Custom Search

samedi 25 décembre 2010

NGRAM VIEWER From Google. Working with words and an immense corpus in differents languages. A dream became true

A Very Data Christmas

This week Google announced its Ngram Viewer, which allows you to explore the use of words in thousands of texts overtime, going back two hundred years. Given the relatively long time period covered by this massive data set, it is fun to explore how language has changed overtime.
Some texts, however, seem to transcend time. One great example of such texts are Christmas carols, many of which have origins dating back many hundreds of years. As a holiday gift to ZIA readers I thought it would be fun to explore the lyrics of Christmas carols, and see how the word usage in these songs compares with today’s lexicon. To do so I needed two things: first, Christmas carol texts; and second, a way to compare the usage of words in those songs to that of today.
A simple Google search for Christmas carol lyrics yielded this site, which I downloaded into a single text file. Then, I used the R tmpackage to create a clean word corpus from this text, stripping out English stopwords, punctuation and case. This left me with 755 words to explore.
The above figure shows the frequency of the most popular words from the 25 carols in the corpus. In this case, “most popular” is defined as a frequency of 10 or more, which amounts to 42 words. Perhaps unsurprisingly, words related to the birth of Jesus Christ are by far the most popular, and then those with a general celestial theme the second most. While this is interesting, it is only have of the exploration. To put these lyrics in context I used the Infochimps API to get the usage statistics for all of these words on Twitter.
For consistency, I have extracted the top 42 most frequently used Carol words on Twitter from the full data set. In keeping with the spirit of Christmas, “love” is the most popular word on Twitter. The others seem to make sense, as they include several temporal words, but I must admit I did get a chuckle seeing the popularity of “ass” given this context.
Finally, to compare the usage in the carols to peoples’ tweets I generated a simple scatter plot. On the x-axis is the Twitter usage in log-scale, the y-axis the frequency of words in carols, and each word is colored from red to green based on the normalized number of unique users that have used the term—with green corresponding to more users. To reduce clutter I restricted this plot to only words that appeared in carols five or more times.
What’s most interesting (to me) is that there is roughly a linear relationship for some words, starting with “excelsis” in the lower-left, and moving up through “christ,” “born,” “king,” and “night.” There are, however, many more words that do not fit this pattern, and in fact, the most popular words from the sample on Twitter are relatively infrequent in the carols, such as “love” and “day.”
The code and data for all of this are available on my github, and I hope you can take some time out from singing to your neighbors to play around with it.
Happy holidays to all, and a very merry (data) Christmas. See you in 2011!

Automatically Generated Related posts:
  1. Brief Analysis of Abdulmutallab (Christmas Day bomber) Web Posting Data
  2. Merry #Christmas
  3. Dewar’s Initial Look at Wikileak’s Iraq Data
  4. Happy Anniversary NYC R Meetup
  5. Benford’s Law Tests for Wikileaks Data

Aucun commentaire:

Disqus for bookoflannes

Intense Debate Comments