Culturomics

Several prominent scholars and the Google Books team have published a new paper that’s generating a lot of buzz (Google pun intended). The paper is in Science (available here) and (update) here's the abstract:

We constructed a corpus of digitized texts containing about 4% of all books ever printed. Analysis of this corpus enables us to investigate cultural trends quantitatively. We survey the vast terrain of "culturomics", focusing on linguistic and cultural phenomena that were reflected in the English language between 1800 and 2000. We show how this approach can provide insights about fields as diverse as lexicography, the evolution of grammar, collective memory, the adoption of technology, the pursuit of fame, censorship, and historical epidemiology. "Culturomics" extends the boundaries of rigorous quantitative inquiry to a wide array of new phenomena spanning the social sciences and the humanities.

It’s readable, thought-provoking, and touches on many fields of study, so I imagine it will be widely read and cited. Others have noted many of the highlights, so here are some brief bulleted thoughts:

  • The authors don’t explore the possible selection bias in their study. They note that the corpus of books they studied includes 4% of all published books. They specifically chose works that scanned better and have better metadata (author, date of publication, etc), so it seems quite likely that these works differ systematically from those that were scanned and not chosen, and differ even more from those not yet scanned. Will the conclusions hold up when new books are added? Since many of the results were based on random subsets of the books (or n-grams) they studied, will those results hold up when other scholars try and recreate them with separate randomly chosen subsets?
  • Speaking of metadata, I would love to see an analysis of social networks amongst authors and how that impacts word usage. If someone had a listing of, say, several hundred authors from one time period, and some measure of how well they knew each other, and combined that information with an analysis of their works, you might get some sense of how “original” various authors were, and whether originality is even important in becoming famous.
  • The authors are obviously going for a big splash and make several statements that are a bit provocative and likely to be quoted. It will be great to see these challenged and discussed in subsequent publications. One example that is quotable but may not be fully supported by the data they present: “We are forgetting our past faster with each passing year.” But is the frequency with which a year (their example is 1951) appears in books actually representative of collective forgetting?
  • I love the word plays. An example: “For instance, we found “found” (frequency: 5x10^-4) 200,000 times more often than we finded “finded.” In contrast, “dwelt” (frequency: 1x10^-5) dwelt in our data only 60 times as often as “dwelled” dwelled.”
  • The “n-grams” studied here (a collection of letters separate from others by spaces, which could be words, numbers, or typos) are too short for a study of copying and plagiarism, but similar approaches could yield insight into the commonality of copying or borrowing throughout history.