Tuesday, January 7, 2014

Reading digital sources: a case study in ship's logs

World sailing routes including CLIWOC database (1750-2000)
by Ben Schmidt
"Extracted ship's paths from the ICOADS database plotted in Cartesian space to reveal outlines of continents.
The data was incomplete in all sorts of really fascinating ways before I got to it; I've downsampled the two later periods so there are approximately the same number of ship-days in each collection."
From Ben Schmidt blog

Digitization makes the most traditional forms of humanistic scholarship more necessary, not less.
But the differences mean that we need to reinvent, not reaffirm, the way that historians do history.

This month, I've posted several different essays about ship's logs.
These all grew out of a single post; so I want to wrap up the series with an introduction to the full set.
The motivation for the series is that a medium-sized data set like Maury's 19th century logs (with 'merely' millions of points) lets us think through in microcosm the general problems of reading historical data.
So I want in this post to walk through the various parts I've posted to date as a single essay in how we can use digital data for historical analysis.
All voyages from the ICOADS US Maury collection.
Ships tracks in black, plotted on a white background, show the outlines of the continents and the predominant tracks on the trade winds.

The central conclusion is this: To do humanistic readings of digital data, we cannot rely on either traditional humanistic competency or technical expertise from the sciences.
This presents a challenge for the execution of research projects on digital sources: research-center driven models for digital humanistic resource, which are not uncommon, presume that traditional humanists can bring their interpretive skills to bear on sources presented by others.

We need to rejuvenate three traditional practices: first, a source criticism that explains what's in the data; second, a hermeneutics that lets us read data into a meaningful form; and third, situated argumentation that ties the data in to live questions in their field.

Inverted Ship Paths

Historians tend to view that third part, argumentation, as the heart of their creative endeavor.
But the widespread availability of digital sources calls that priority into question.
The outlines of historical argument tend to be quite constrained.
Most anyone can slap together an argument that (to take a fictional event) bureaucratic continuity largely pre-determined the course of Rufus T. Firefly's administration; that Bob Roland's role in instigating the Sylvanian war has been underestimated; or that Gloria Teasdale creatively exploited traditional expectations of femininity to take on enormous power without explicitly challenging the status quo.
The historian's real contribution is in assembling the evidence to make those claims convincingly, and knowing how to effectively read the sources so as not to be misled by all their biases.

In the past, historians held a safe monopoly over the first two stages that allowed them to develop uncontested expertise.
But confronted with digital sources, their hold is much more tenuous.
Facing a digital source base with primarily expertise in close reading and navigating traditional archives, we are--whether we admit it or not--largely disarmed.
A historian whose access is mediated by an archivist tends to know how best to interpret her sources; one plugging at databases through dimly-understood methods has lost his claim to expertise.

Ship's logs can illustrate what it might mean to build this historical expertise on a digital source base.
The sources I've been working with--climatological records from the National Oceans and Atmospheric Administration--are obviously historically interesting and neglected.
In addition to the Maury collection I've been examining, it contains extensive records of the US Navy in World War II, the Japanese merchant marine over much of the post-Meiji period, and millions of other records that show the commercial and military interconnections of the world at sea.
They're problem is that they are essentially intractable to more traditional forms of historical analysis, while still significantly less complicated than the massive textual collections in which I (like most humanists) see the greatest potential for future research.

The first post offers that source criticism by means of a genealogy of the shipping data that we have.
To use any sort of historical data, we must above all understand the constraints under which it was collected.
In this case, that means retelling the history of why and how the ship's logs were first collected, and how the constraints of digitization in the punch card era radically shape the sort of evidence we can draw from them.
The important thing about this sort of work is that it helps us understand the overall biases of a particular data set, which is crucial for limiting our interpretive leaps.

ICOADS voyages from the CLIWOC and US Maury decks,
plotted to show the outlines of the continents.
Voyages shown are between 1850 and 1960.

The Maury collection (and the full ICOADs set) presents a welter of conflicting visions.
Humanists and scientists alike, trained in the language of survey research, tend to ask of data sets: "Is it a representative sample?"
I doubt there is a single dataset of interest to historians that is.
But while attempting to normalize away the biases in a sample is the best scientific solution to the problem, the humanistic approach is to understand a source through its biases without expecting it to yield definitive results.

While this is the central goal of digital source criticism, it can be quite interesting in itself: that ship's records were digitized before computers existed (or more precisely, when computers were women) ensures that we treat digitization not as the default fate of all historical objects but as the result of peculiar institutional choices.
In histories based on large textual corpora, this requires trying to understand the changing acquisition and cataloging patterns of dozens of different libraries and their interactions with thousands of presses.
With ships, at least, there are relatively few organizations whose imperatives we need to understand

A hermeneutics of data is harder than understanding the biases of its sources.
I take the position that the best way to 'read' data is through visual representation
 (At least, most forms of auditory representation or narrative description are far less good at allowing cyclical representation).
A first, basic step is simply understanding what's contained in the data.
Basic visualizations of ships moving over time (which I developed in an earlier post, not in this series) allow strong insights into what's going on in the data, and are generative of new questions.
One quickly notices in the seasonal plot of whaling patterns, for instance, massive migrations north and south each year that turn out to be whaling ships:

Paths taking by American ships from about 1800 to 1860,
running as if in a single year to show seasonal patterns.

To move from this observation to the study of the whaling ships in particular requires integrating algorithmic techniques into the cycle of visualization.
I show how particular machine-learning algorithms can be used to extract subsamples of interest from the dataset, as well as give a view of its overall shape more interpretable than simply plotting them.

Again, this carries implications for textual research.
My preferred method here, a two level application of K-nearest neighbor classification based on a training set tagged by a number of origin ports, only began to work after an iterative searching through several techniques.
In this sense, visualization on a corpus of geographical data is easy.
I was able to keep visualization rooted in a cycle of reading against maps where individual ships could be pulled out and checked against the full data, and where visual inspection could confirm algorithmic sorting to be 'working' or 'broken.'
That led me to worry over the uncritical acceptance of topic modeling in textual research.
Topic models fail on whaler voyages in a way that would not be detected by most 'normal' users of topic models, since visualization for model fit is somewhere from elusive to impossible.

*Visualization to explore the model is possible, but that's something quite different: traditionally, visualization is used to find patterns on raw data, and to confirm the fit of models. But topic models tend to be so complicated and specialized that we need visualization simply to understand what they're saying.

Technical competence, however, is insufficient.
A hermeneutics of data also has to deal with the complications of working with these statistical aggregates at all.
As I've argued before, digital history needs theoretical justifications for its reading of aggregate sources.
I tried to give that in my post on the benefits of writing digital histories that avoid telling the stories of individuals.
"Whaling" and "ship's voyages" are not nearly as complicated aggregates as the ones that digital history should really be investigating: things that we can investigate with texts like gender, academic disciplines, and generations.
But even there, there is a case to be made for stories that conceive of aggregate systems rather than individual actors.

*At the same time, one of the benefits of working in a digital medium is that there's a ready avenue to share evidence that's outside.
For instance, some of the ships most important to the plot of Moby-Dick happen to be in Maury's database: it's easy enough to break those maps out and show the tracks of the Essex and the Acushnet.

Finally, in the central piece in the series, I try to apply that hermeneutics and source criticism to argue how this interpretive framework allows us to recenter our interpretations of the place of shipping and extraction in the mid-19th century United States.
This makes use of visualization again, but as a narrative technique rather than the heuristic role it played in data selection.
Narrating voyages through data visualization clarifies the unique role of the whaling industry in American shipping: it is both the primary industrial use of the sea (as opposed to commercial voyages that reach across it), and a self-exhausting process of resource depredation that gives an unlikely perspective on the movement patterns of early American capitalism.
The progressive depletion of whaling grounds drives the fleet farther and farther afield each year, expanding the reach of American voyagers.

Whaling voyages from logbooks collected in the 19th century by Lt. Matthew Maury: the tracks show all the places whale ships have been, highlighting the exploitation of more and more remote whaling grounds over the mid-19th century.

This is a geographical reach has enormous consequences; one can see how and why the internal dynamics of whaling brought previously exotic lands including Japan and Alaska into the American sphere, while placing the Hawaiian islands at the center of a trans-pacific network.
At the same time, thinking of the operation of the whaling system makes clear just how historically limited each of these interactions was: although the Sea of Japan was one of the hotbeds of whaling in the summers of 1848 and 1849, it was almost entirely abandoned by the time Matthew Perry's "Black Ships" arrived.
Industrial depletion operates by a clearly evident logic, but one quite different from commercial interconnection.
As historians, driven by contemporary politics, try to analogize from the well-trod grounds of the interlinked Atlantic World to a new Pacific one, we could do worse than to recognize the differences as well as the similarities.

A couple due thanks: to Dael Norwood, through whom I've gotten most of my information about 19th century shipping history and who was explaining the importance of Maury's collecting practices before I knew there was Maury logbook data out there.

And a plug for the New Bedford Whaling Museum, which does a great job curating the history of whaling through actual material culture.
Which, I'll concede, has its advantages.
The chronological sequence of increasingly inventive cruelty in the "Harpoons and Whalecraft" exhibit in the Bourne Building is something to behold.

Links :