Tag Archives: data mining

Spanish Flu Project at Virginia Tech via Chronicle of Higher Education


Soldiers with the Spanish flu are hospitalized inside the U. of Kentucky gym in 1918. In one prevention method examined in a new study, New Yorkers were advised to refrain from kissing “except through a handkerchief.” – via the Chronicle of Higher Education

The Chronicle of Higher Education (February 27, 2015) contains an article by Jennifer Howard on the Spanish Flu Project: a big data project funded by the NEH (among other entities) exploring reporting on the 1918 Spanish Flu. As Howard describes the research:

The team began with several questions: How did reporting on the Spanish flu spread in 1918? And how big a role did one influential person play in shaping how the outbreak was handled? . . . Royal S. Copeland was the health commissioner of New York City in August 1918, when a ship arrived in New York Harbor from Europe with flu victims aboard . . . . Copeland helped set the tone for how the nation reacted to a viral threat—and has been the subject of debate among historians ever since, with competing camps arguing about whether he did enough.

Researchers would typically scour public statements by Copeland to answer these questions. But since the outbreak was “well documented in the popular press of the day,” it seemed an ideal topic for “digitally enabled scholarship.”

Using the Library of Congress’s Chronicling America database of historical newspapers, the HathiTrust Digital Library, and other sources, the Virginia Tech researchers sought out direct and indirect evidence of Copeland’s role: mentions and quotations, references to flu-containment strategies he promoted. “You can see his influence even if his name’s not used,” Mr. Ewing says.

The article does a good job highlighting the strengths and weaknesses of this form of digital scholarship. As Howard notes, this complex project requires both “code and context”:

To produce useful results, this kind of investigation depends on customized algorithms. But coming up with a good algorithm involves both code and context, a mingling of the complementary strengths of computer scientists and humanists . . . . The hybrid, trial-and-error nature of the Spanish-flu investigation may say something about the current state of computer-assisted humanities work. Mr. Bobley of the NEH says he has been impressed with the flu researchers’ “candid thoughts on how computational approaches like data mining are no magic bullet,” even as they expand what humanists can do. The work is a reminder, he says, that “historical documents like newspapers are rich, messy, nuanced, and complex documents that defy easy computational analysis.”

Leave a comment

Filed under Digital Humanities

SLATE’s The Vault: Five of 2014’s Most Compelling Digital History Exhibits and Archives

From Slate‘s history blog, The Vault, Rebecca Onion features five digital collections and/or historical websites:

2014 brought us a wealth of new digital archives and document-rich historical websites to peruse. Here, in no particular order, are five of the best such sites I saw this year.”

Follow the link to enjoy. She promises a link to five more sites tomorrow.

Historical documents online: Five best digital archives from 2014.

Leave a comment

Filed under Digital Collections, Digital Humanities

CFP: Data Driven: Digital Humanities in the Library | HASTAC

Call For Conference Proposals

Data Driven: Digital Humanities in the Library

June 20-22 2014, Charleston, SC

Guidelines for Submission
Lightning Round/Paper/Panel deadline: 01 December 2013
Workshop proposal deadline: 01 February 2014

General Information

“Data Driven: Digital Humanities in the Library,” sponsored by the South Carolina Digital Library, the College of Charleston and the Charleston Conference, invites submissions for its 2014 conference, on all aspects of digital humanities in the library. This includes but is not limited to:

  • Digital scholarship
  • Humanities & library collaborations on DH projects
  • GIS and/or data visualization projects
  • Text mining & data analysis
  • Digital humanities librarianship
  • Digital project management
  • Knowledge lifecycle, including production & collaboration
  • Creating or using tools & services for the production, editing and/or analysis of DH data
  • Metadata and linked data in DH

We particularly welcome collaborative panel and paper submissions from librarian and humanities scholar-based teams and/or graduate students. We strongly encourage any proposals relating to the theme of the conference.

CFP: Data Driven: Digital Humanities in the Library | HASTAC.


Leave a comment

Filed under Digital Collections, Digital Humanities, Library science, workshop/conference

Big data meets the Bard – FT.com

Big data meets the Bard – FT.com.

Yet another article about the perhaps “diabolical” use of “Big Data” in the humanities. The article describes the author’s reactions to a Skype seminar from the Stanford Literary Lab. While I don’t think that “Big Data” will replace actually reading novels, I did cringe at this quote:

Ryan Heuser, 27-year-old associate director for research at the Literary Lab, tells me he can’t remember the last time he read a novel. “It was probably a few years ago and it was probably a sci-fi. But I don’t think I’ve read any fiction since I’ve been involved with the lab.”

But reading books, and analyzing Big Data, as I’ve said before on this blog, are different –and complementary–tasks.

Leave a comment

Filed under Digital Collections, Digital Humanities, liberal arts colleges

Data mining the classics makes for beautiful science –


Print Friendly



Jockers, Matthew, Stanford University, USA, mjockers@stanford.edu



Whether consciously influenced by a predecessor or not, it might be argued that every book is in some sense a necessary descendant of, or necessarily connected to, those before it. Influence may be direct, as when a writer models his or her writing on another writer,2 or influence may be indirect in the form of unconscious borrowing. Influence may even be oppositional as in the case of a writer who wishes to make his or her writing intentionally different from that of a predecessor. The aforementioned thinkers offer informed but anecdotal evidence in support of their claims of influence. My research brings a complementary quantitative and macroanalytic dimension to the discussion of influence. For this, I employ the tools and techniques of stylometry, corpus linguistics, machine learning, and network analysis to measure influence in a corpus of late 18th- and 19th-century novels.




The 3,592 books in my corpus span from 1780 to 1900 and were written by authors from Britain, Ireland, and America; the corpus is almost even in terms of gender representation. From each of these books, I extracted stylistic information using techniques similar to those employed in authorship attribution analysis: the relative frequencies of every word and mark of punctuation are calculated and the resulting data winnowed so as to exclude features not meeting a preset relative frequency threshold.3 From each book I also extracted thematic (or topical) information using Latent Dirichlet Allocation (Blei, Ng et al. 2003; Blei, Griffiths et al. 2004; Chang, Boyd-Graber et al. 2009). The thematic data includes information about the percentages of each theme/topic found in each text.4 I combine these two categories of data – stylistic and thematic – to create book signals composed of 592 unique feature measurements. The Euclidian” metric is then used to calculate every book’s distance from every other book in the corpus. The result is a distance matrix of dimension 3,592 x 3,592.5

While measuring and tracking actual or true influence – conscious or unconscious – is impossible, it is possible to use the stylistic-thematic distance/similarity measurements as a proxy for influence.6 Network visualization software can then be used as a way to organize, visualize, and study the presence of influence among of books in my corpus.7


via Data mining the classics makes for beautiful science – Future of Tech on NBCNews.com.

This is a really interesting measure of literary “influence.” My project indicates influence in a different fashion, by tracing the relations between words or tropes in literary circles.

Leave a comment

Filed under Digital Humanities

The Digital Humanities and Interpretation – NYTimes.com

You might wonder, for example, what place or location names appear in American literary texts published in 1851, and you devise a program that will tell you. You will then have data.

But what do you do with the data?

The example is not a hypothetical one. It is put forward by Matthew Wilkens in his essay “Canons, Close Reading, and the Evolution of Method” (“Debates in the Digital Humanities,” ed. Matthew Gold, 2012). And Wilkens does do something with the data. He notices that “there are more international locations than one might have expected” — digital humanists love to be surprised because surprise at what has been turned up is a vindication of the computer’s ability to go beyond human reading — and from this he concludes that “American fiction in the mid-nineteenth century appears to be pretty diversely outward looking in a way that hasn’t received much attention.”

More international locations named than we would have anticipated; therefore mid-19th century American fiction is outward-looking, a fact we would not have “discovered” were it not for the kind of attention a computer, as opposed to a human reader, is capable of paying . . .

But does the data point inescapably in that direction? Don’t we have to know in what novelistic situations foreign lands are alluded to and by whom? If the international place names are invoked by a narrator, it might be with the intention not of embracing a cosmopolitan, outward perspective, but of pushing it away: yes, I know that there is a great big world out there, but I am going to focus in on a landscape more insular and American. If a character keeps dropping the names of towns and cities in Europe, Africa and Asia, the novelist could be alerting us to his pretentiousness and admonishing the reader to stay close to home. If a more sympathetic character daydreams about Paris, Istanbul and Moscow, she might be understood as caressing the exotic names in rueful recognition of the experiences she will never have.

The list of possible contextual framings is infinite, but some contextual framing is necessary if we are to move from noticing the naming of international locations to the assigning of significance. Otherwise we are asserting, without justification, a correlation between a formal feature the computer program just happened to uncover and a significance that has simply been declared, not argued for. (Frequency is not an argument.) Don’t we have to actually read the books, before saying what the patterns discovered in them mean?


via The Digital Humanities and Interpretation – NYTimes.com.

I agree with Fish that “data mining” as he describes it is inadequate to the task of interpretation as he defines it. However, data mining is not meant to (nor can it) provide information about something as abstract as “intent;” it is a way of looking for and at patterns in data sets that are too large for human comprehension, and that address a larger issue than the author’s “intentionality.” Screening out data from “noise” is precisely what such algorithms are designed to accomplish; and the patterns discovered have less to do with the “intentionality” of an author’s writing than the circulation of ideas, words, and concepts throughout a historical period or geo-spatial context.

I think the example he cites about “more international locations named than we would have anticipated” in American fiction is precisely the sort of data that is interesting. Fish argues that we can’t determine the “direction” of that international interest for any particular text without close reading, and he questions whether the directionality of the interest can be meaningfully ascertained from the aggregate data. But “outward-looking” means something different in a geo-political or geo-spatial analysis than in a close reading. This is an analysis that moves away from individual authority and intentionality, from the text as an artifact of a particular human intelligence, and looks at the cultural field of translatlantic influence in a particular historical, cultural, geographic moment. Mid-19th century American texts have a larger boundary or horizon than British texts, perhaps, and this is one way to measure it.

Data needs interpretation, yes; but close reading is not the only interpretive strategy out there.


via The Digital Humanities and Interpretation – NYTimes.com.

Leave a comment

Filed under Digital Humanities