This project started with an analysis of different programming language subreddits, and since it was (mostly ^_-) received well, I decided to look into some other communities too. The result is the webpage you are reading right now.
All submission IDs of a subreddit inside the mentioned time period are obtained by using queries like the ones generated by the reddit time machine. Then the comments are loaded via PRAW and put into a SQLite database, which is searched with regular expressions. The static diagrams are generated utilizing matplotlib and the interactive ones use D3.js. Finally this blog is compiled from markdown by Hakyll.
Why did you chose the used metric?
comments_containing_word / 10000 comments
could also have word_count in its numerator, but this would allow single comments that use a word very often allow to destroy the numbers
The denominator could be sum_of_characters_in_all_comments to take comment lengths into account, but that would allow single very long comments with copy pasted articles, source codes, raw data etc. allow to skew this very much.
So I hope the chosen metric is the least unstable one.
Can you take negations into account?
Parsing the meaning of human language is (at least for me) very difficult. Finding negations is not just putting ‘not’ in front of the word for the search query. But I anybody known how to do something like that reliably, please contact me. :)
Since when is ‘hate’ a swear word?
The ‘cursing’ section mainly shows expression of negative sentiment. But since the title fells too long I kept the simpler, yet inaccurate, one.