Monthly Archives: August 2012

Data mining the classics makes for beautiful science –


Print Friendly



Jockers, Matthew, Stanford University, USA,



Whether consciously influenced by a predecessor or not, it might be argued that every book is in some sense a necessary descendant of, or necessarily connected to, those before it. Influence may be direct, as when a writer models his or her writing on another writer,2 or influence may be indirect in the form of unconscious borrowing. Influence may even be oppositional as in the case of a writer who wishes to make his or her writing intentionally different from that of a predecessor. The aforementioned thinkers offer informed but anecdotal evidence in support of their claims of influence. My research brings a complementary quantitative and macroanalytic dimension to the discussion of influence. For this, I employ the tools and techniques of stylometry, corpus linguistics, machine learning, and network analysis to measure influence in a corpus of late 18th- and 19th-century novels.




The 3,592 books in my corpus span from 1780 to 1900 and were written by authors from Britain, Ireland, and America; the corpus is almost even in terms of gender representation. From each of these books, I extracted stylistic information using techniques similar to those employed in authorship attribution analysis: the relative frequencies of every word and mark of punctuation are calculated and the resulting data winnowed so as to exclude features not meeting a preset relative frequency threshold.3 From each book I also extracted thematic (or topical) information using Latent Dirichlet Allocation (Blei, Ng et al. 2003; Blei, Griffiths et al. 2004; Chang, Boyd-Graber et al. 2009). The thematic data includes information about the percentages of each theme/topic found in each text.4 I combine these two categories of data – stylistic and thematic – to create book signals composed of 592 unique feature measurements. The Euclidian” metric is then used to calculate every book’s distance from every other book in the corpus. The result is a distance matrix of dimension 3,592 x 3,592.5

While measuring and tracking actual or true influence – conscious or unconscious – is impossible, it is possible to use the stylistic-thematic distance/similarity measurements as a proxy for influence.6 Network visualization software can then be used as a way to organize, visualize, and study the presence of influence among of books in my corpus.7


via Data mining the classics makes for beautiful science – Future of Tech on

This is a really interesting measure of literary “influence.” My project indicates influence in a different fashion, by tracing the relations between words or tropes in literary circles.

Leave a comment

Filed under Digital Humanities