In October 2014, I presented “Data Mining on Vendor-Digitized Collections” at the Charleston Conference alongside Peter Leonard, Librarian for Digital Humanities Research, Yale University, and other colleagues (1). The Charleston Conference was the perfect setting in which to discuss collaboration between vendors and librarians on large-scale digital humanities work. After demonstrating some current data-mining projects at Yale and elsewhere to show the appeal of digital humanities across disciplines, we talked about issues of research platforms, file formats, acquisitions workflows, copyright and licensing agreements. Our session drew a full room of attendees who responded enthusiastically and asked questions from a wide range of perspectives.
What is data mining?
Data mining leverages computational methods for the analysis of large digital collections of texts and images. It is an umbrella term for an array of tools that enables us to go beyond the capabilities of keyword or full-text searches — for example, to quantitatively compare the language usage of male vs. female authors within a library of books, map the birthplaces of artists represented in a museum’s collection, or chart the appearance of Tiffany & Co. ads in Vogue magazine over time. Using image analysis techniques, we can even sort pictures by color or analyze variations of hue and saturation over time. Computer science techniques enable further analyses like face recognition or segmentation of the geometry of the page. Human scholars come up with the questions and interpret the results, but they can zoom out on a body of material in ways that were prohibitively labor-intensive before.
What is the librarian’s role in data mining?
Access to large datasets comes to many researchers via the library — historical newspaper databases, digitized literary texts, census data, geospatial data, and on and on. The subject librarian’s role is to connect researchers to the information in their fields and to identify possibilities for tools and cross-campus projects. At Yale we are fortunate to have a librarian for digital humanities research, a Digital Humanities Laboratory in the works, and numerous faculty members and graduate students interested in this kind of work themselves and for their students. These scholars see the library as a starting point for this kind of research.
Robots Reading Vogue
A particularly exciting project I am involved in at Yale Library, one of many underway here
, applies data mining techniques to a well-marked-up corpus of data: the ProQuest Vogue Archive
. Peter Leonard and I call this collection of data mining experiments Robots Reading Vogue
, and we have used it to demonstrate the research opportunities a large and robust collection of digital data can provide to researchers.
The Vogue Archive includes an image of every page of every issue of American Vogue back to 1892, with XML markup of full text, advertisers, photographers, editors, etc. When a user is presented with the opportunity to search or browse through such a vast archive, how does he or she even begin to know what to look for?
Our n-gram search tool
allows users to chart the usage of individual words and phrases across Vogue
’s 122 years of publication. It uses Bookworm
, the open-source, “bring your own books” version of the Google Books Ngram Search developed by Google and the Harvard Cultural Observatory. The n-gram tool defaults to sample searches of terms rising and falling in Vogue
over the years. The search boxes can be easily adjusted for different queries, even comparing frequency of the same word in advertisements versus articles, for example.
Another way of digging into large datasets is letting the data organize itself, using a technique called topic modeling
borrowed from computer science. Even though the algorithm has no knowledge of English or of particular subjects or historical contexts, it groups the words into clusters that statistically tend to appear in proximity to each other. Those clusters form recognizable topics like “art” or “advice and etiquette.” We can then track those topics over Vogue
’s history to see what was being written about and when it was written.
The tool that facilitates comparisons of advertisements
relies on the metadata in the archive to count, average, normalize and sort. It can help answer questions like: Which tobacco advertiser placed the most ads and when? Which automobile company first advertised in Vogue
In creating these and other experiments mining the Vogue Archive, we were not interested in replicating the search features and presentation of the ProQuest interface. We also do not make copyrighted material publicly available. In fact, our expectation is that use of Bookworm and other data-mining tools will actually drive new traffic to digitized archives like the ProQuest Vogue Archive. The September 2012 issue of Vogue is old fashion news; the aggregated 122-year archive available for digital investigation sparks interest from computer scientists and gender studies professors alike.
Collaboration in the service of research
This type of work represents new opportunities for collaboration between vendors and librarians to serve the research needs of faculty and students. At the moment, many people’s work is needed to securely transfer files of licensed content from vendors to libraries and steward their use by researchers. As digital humanities methods become more widely used, systems to facilitate this transfer of data will have to scale up to meet demand. Vendors are recognizing the need for scalable approaches to making raw data readily available for researchers at libraries that subscribe to their products. Two examples are JSTOR Data for Research
and Gale Digital Collections from Gale/Cengage Learning
Robots Reading Vogue would not be possible without the systematic digitization and metadata markup undertaken by ProQuest, as it would require a prohibitively expensive outlay of resources for any library to recreate. In short, this is work that we in the library world are glad to have done by vendors. We hope that our discussions based on real-world data-mining applications have helped outline the requirements for specifications on these types of database products looking ahead.
Digital humanities approaches have been shifting from the margins to the mainstream of academic work in recent years — not replacing, but augmenting research methods for scholars in literature, history, art history and other fields. This digital shift is necessitating revisions in standards for publication, promotion and tenure, as well as expectations of technological proficiency. Students and faculty at Yale have been excited for years about having access to great historical resources like the Vogue Archive, but it’s even more exciting to think about them leveraging data-mining tools to ask questions none of us could imagine answering before.
1. Daniel Dollar (Director of Collection Development, Yale), Peg Knight (Senior Product Manager for the Arts, ProQuest), and Niels Dam (Vice President for Product Management, ProQuest)