The authors of this article have attended Data Scientist Training for Librarians: Jeremy Guillette — course 1 and course 2 (TA), and James Damon — course 2. Visit the DST4L blog and participate on Twitter using #DST4L.
A course called Data Scientist Training for Librarians (DST4L), held at Harvard University for librarians in the Boston area, has led to a community of librarians who work on data-driven projects.
During the first DST4L offering in the spring of 2013, students went from data extraction and cleanup to forging links between data items by API queries and, finally, creating visualizations and presenting a coherent narrative for a data-driven story. Students were divided into teams and each team took on a data-driven project that they worked on for the duration of the course. A series of instructors introduced tools such as Python, Git, OpenRefine, R, and Tableau during weekday lessons, and the students put them to use on weekends when they met to work on projects.
The second offering of DST4L, held in the fall of 2013, started with a two-day boot camp for a new cohort of data savvy librarians in training, taught by Software Carpentry. Subsequently students met for three hours each Tuesday morning for 14 weeks. In addition to covering Python, Git, and OpenRefine, the course covered some of the finer points of Excel and introduced Gephi, a network analysis and visualization tool. Students then developed their own projects and collaborated for a two-day hackathon. The course concluded with presentations and a panel discussion led by participants.
We didn’t become librarians because we lack the creativity, passion, talent or ambition to pursue our own interests.
DST4L in Context: Russian Gazetteer
(Note: In this section, “we” refers to the project participants Jeremy Guillette, an LIS student at Simmons Graduate School of Library and Information Science, and Hugh Truslow, Fung Head Librarian and DST4L alumnus).
One project to come out of DST4L is the construction of a Russian gazetteer, a geographical dictionary used in conjunction with a map or atlas, at Harvard’s Fung Library. In this project, we took a document that had already been digitized by the Hathi Trust and put it through an OCR program to extract the text with formatting intact. With the text and formatting in a machine-readable format, we could extract particular parts of the text — in this case, the names, types, and administrative units of places mentioned in the text. We then sent the names to a geolocation API, specifying the approximate areas based on our knowledge of the administrative areas. However, the late-1700s text uses letters and spellings that are not part of modern Russian, meaning that some transformations had to be applied before the names can be used. Fortunately, the project is structured so that the tools to locate places in this text can be applied to other Russian texts and, with a bit of modification, to texts and location data in any language.
The most difficult part of the project — and the part where the DST4L tools, skills and ways of thinking have been most helpful — is reconciling older place names with current ones. The text uses archaic spellings, as well as letters that went out of use when Russian spelling was reformed in the early 1900s. By creating simple rules in Python to update those instances and applying them to the place names before asking an API for their locations, we can automatically replace them with modern equivalents.
Sometimes, the API doesn’t return results for a place name or it returns multiple possible matches in an area (meaning we didn’t find an exact match). By consistently applying increasingly sophisticated transformations to the text, we aim to automate as much of the process as possible before doing research by hand. When we have every place mentioned in the text georeferenced, we will put the results into an existing interface at the Center for Geographic Analysis so that others can build them into their own workflows.
Why take data science for a spin?
One of the great things about DST4L is its hands-on approach to data-driven projects. Even for librarians who don’t intend to work directly on these kinds of projects, the class engenders a different mindset. After working directly with messy data, oddly formatted websites that are resistant to scraping, unavailable or difficult-to-access data, or any of the dozens of other headaches that can plague researchers, you start to think about things a bit differently. You see the benefits of opening up access to your own data, potential connections between datasets, and perhaps an idea of what everyone’s talking about when they say “big data.”
The specific lessons, tools and projects of DST4L provide participants with valuable skills and knowledge. However, DST4L’s greatest benefit may be the sense of continued growth and development in our profession. We didn’t become librarians because we lack the creativity, passion, talent or ambition to pursue our own interests. We didn’t replace our goals with those of our patrons. The success of a service mission is measured by the quality of the service provided. Ultimately, the quality of a librarian’s work is determined by his or her ability to understand and manage connections.
At face value, DST4L is a technology course that teaches librarians to script, extract, wrangle and visualize. At its heart, it reminds librarians to push themselves, each other and the profession. In searching for relationships between sets of data and building the ways to visualize them, we can make connections. Linked data becomes an open door. The potential of the Web becomes that much more visible. In class, we even got to know our neighbors. There is certainly no substitute for the colleagues you meet and the community you create in pursuit of knowledge.