by Ben Schmidt
"Extracted ship's paths from the ICOADS database plotted in Cartesian space to reveal outlines of continents.
The
data was incomplete in all sorts of really fascinating ways before I
got to it; I've downsampled the two later periods so there are
approximately the same number of ship-days in each collection."
From Ben Schmidt blog
Digitization makes the most traditional forms of humanistic scholarship
more necessary, not less.
But the differences mean that we need to
reinvent, not
reaffirm, the way that historians do history.
This month, I've posted several different essays about ship's logs.
These all grew out of a single post; so I want to wrap up the series
with an introduction to the full set.
The motivation for the series is
that a medium-sized data set like Maury's 19th century logs (with
'merely' millions of points) lets us think through in microcosm the
general problems of reading historical data.
So I want in this post to
walk through the various parts I've posted to date as a single essay in
how we can use digital data for historical analysis.
Ships
tracks in black, plotted on a white background, show the outlines of
the continents and the predominant tracks on the trade winds.
The central conclusion is this: To do humanistic readings of digital data, we cannot rely on
either traditional
humanistic competency or technical expertise from the sciences.
This
presents a challenge for the execution of research projects on digital
sources: research-center driven models for digital humanistic resource,
which are not uncommon, presume that traditional humanists can bring
their interpretive skills to bear on sources presented by others.
We need to rejuvenate three traditional practices: first, a
source criticism that explains what's in the data; second, a
hermeneutics that lets us read data into a meaningful form; and third,
situated argumentation that ties the data in to live questions in their field.
Inverted Ship Paths
Historians tend to view that third part, argumentation,
as
the heart of their creative endeavor.
But the widespread availability
of digital sources calls that priority into question.
The outlines of
historical argument tend to be quite constrained.
Most anyone can slap
together an argument that (to take a fictional event) bureaucratic
continuity largely pre-determined
the course of Rufus T. Firefly's administration; that Bob Roland's
role in instigating the Sylvanian war has been underestimated; or that
Gloria Teasdale creatively exploited traditional expectations of
femininity to take on enormous power without explicitly challenging the
status quo.
The historian's real contribution is in assembling the
evidence to make those claims
convincingly, and knowing how to effectively read the sources so as not
to be misled by all their biases.
In the past,
historians held a safe monopoly over the first two stages that allowed
them to develop uncontested expertise.
But confronted with digital
sources, their hold is much more tenuous.
Facing a digital source base
with primarily expertise in close reading and navigating traditional
archives, we are--whether we admit it or not--largely disarmed.
A
historian whose access is mediated by an archivist tends to know how
best to interpret her sources; one plugging at databases through
dimly-understood methods has lost his claim to expertise.
Ship's logs can illustrate what it might mean to build this historical
expertise on a digital source base.
The sources I've been working with--
climatological records from the National Oceans and Atmospheric Administration--are
obviously historically interesting and neglected.
In addition to the
Maury collection I've been examining, it contains extensive records of
the US Navy in World War II, the Japanese merchant marine over much of
the post-Meiji period, and millions of other records that show the
commercial and military interconnections of the world at sea.
They're
problem is that they are essentially intractable to more traditional
forms of historical analysis, while still significantly less complicated
than the massive textual collections in which I (like most humanists)
see the greatest potential for future research.
The first post offers that source criticism by means of a
genealogy of the shipping data that we have.
To use any sort of historical data, we
must above all understand the constraints under which it was collected.
In this case, that means
retelling
the history of why and how the ship's logs were first collected, and
how the constraints of digitization in the punch card era radically
shape the sort of evidence we can draw from them.
The important
thing about this sort of work is that it helps us understand the overall
biases of a particular data set, which is crucial for limiting our
interpretive leaps.
ICOADS voyages from the CLIWOC and US Maury decks,
plotted to show the outlines of the continents.
Voyages shown are between 1850 and 1960.
The Maury collection (and the full ICOADs set) presents a welter of
conflicting visions.
Humanists and scientists alike, trained in the
language of survey research, tend to ask of data sets: "Is it a
representative sample?"
I doubt there is a single dataset of interest to
historians that is.
But while attempting to normalize away the biases
in a sample is the best scientific solution to the problem, the
humanistic approach is to understand a source
through its biases without expecting it to yield definitive results.
While this is the central goal of digital source criticism, it can be
quite interesting in itself: that ship's records were digitized before
computers existed (or more precisely,
when computers were women)
ensures that we treat digitization not as the default fate of all
historical objects but as the result of peculiar institutional choices.
In histories based on large textual corpora, this requires trying to
understand the changing acquisition and cataloging patterns of dozens of
different libraries and their interactions with thousands of presses.
With ships, at least, there are relatively few organizations whose
imperatives we need to understand
A hermeneutics of data is harder than understanding the biases of its
sources.
I take the position that the best way to 'read' data is through
visual representation
(At least, most forms of auditory representation
or narrative description are far less good at allowing cyclical
representation).
A first, basic step is simply understanding what's
contained in the data.
Basic visualizations of ships moving over time (
which I developed in an earlier post, not in this series)
allow strong insights into what's going on in the data, and are
generative of new questions.
One quickly notices in the seasonal plot of
whaling patterns, for instance, massive migrations north and south each
year that turn out to be whaling ships:
Paths taking by American ships from about 1800 to 1860,
running as if in a single year to show seasonal patterns.
To move from this observation to the study of the whaling ships in
particular requires integrating algorithmic techniques into the cycle of
visualization.
I
show how particular machine-learning algorithms can be used to extract
subsamples of interest from the dataset, as well as give a view of its
overall shape more interpretable than simply plotting them.
Again, this carries implications for textual research.
My preferred
method here, a two level application of K-nearest neighbor
classification based on a training set tagged by a number of origin
ports, only began to work after an iterative searching through several
techniques.
In this sense, visualization on a corpus of
geographical data is easy.
I was able to keep visualization rooted in a cycle of reading against maps where
individual ships could be pulled out and checked against the full data,
and where visual inspection could confirm algorithmic sorting to be 'working' or 'broken.'
That led me to
worry over the uncritical acceptance of topic modeling in textual research.
Topic models fail on whaler voyages in a way that would not be detected
by most 'normal' users of topic models, since visualization for model
fit is somewhere from elusive to impossible.
*Visualization to explore the
model is possible, but that's something quite different: traditionally,
visualization is used to find patterns on raw data, and to confirm the
fit of models. But topic models tend to be so complicated and
specialized that we need visualization simply to understand what they're
saying.
Technical competence, however, is insufficient.
A hermeneutics of data
also has to deal with the complications of working with these
statistical aggregates at all.
As I've argued before, digital history
needs theoretical justifications for its reading of aggregate sources.
I
tried to give that in my post on
the benefits of writing digital histories that avoid telling the stories of individuals.
"Whaling" and "ship's voyages" are not nearly as complicated aggregates
as the ones that digital history should really be investigating: things
that we can investigate with texts like gender, academic disciplines,
and generations.
But even there, there is a case to be made for stories
that conceive of aggregate systems rather than individual actors.
*At the same time, one of the benefits of
working in a digital medium is that there's a ready avenue to share
evidence that's outside.
For instance, some of the ships most important
to the plot of Moby-Dick happen to be in Maury's database: it's easy
enough to break those maps out and show the tracks of the Essex and the Acushnet.
Finally, in the central piece in the series, I try to apply that
hermeneutics and source criticism to argue how this interpretive
framework allows us to recenter our interpretations of the place of
shipping and extraction in the mid-19th century United States.
This
makes use of visualization again, but as a narrative technique rather
than the heuristic role it played in data selection.
Narrating
voyages through data visualization clarifies the unique role of the
whaling industry in American shipping: it is both the primary industrial
use of the sea (as opposed to commercial voyages that reach across it),
and a self-exhausting process of resource depredation that gives an
unlikely perspective on the movement patterns of early American
capitalism.
The progressive depletion of whaling grounds drives the
fleet farther and farther afield each year, expanding the reach of
American voyagers.
Whaling voyages from logbooks collected in the 19th century by Lt.
Matthew Maury: the tracks show all the places whale ships have been,
highlighting the exploitation of more and more remote whaling grounds
over the mid-19th century.
This is a geographical reach has enormous consequences; one can see how
and why the internal dynamics of whaling brought previously exotic lands
including Japan and Alaska into the American sphere, while placing the
Hawaiian islands at the center of a trans-pacific network.
At the same
time, thinking of the operation of the whaling system makes clear just
how historically limited each of these interactions was: although the
Sea of Japan was one of the hotbeds of whaling in the summers of 1848
and 1849, it was almost entirely abandoned by the time Matthew Perry's
"Black Ships" arrived.
Industrial depletion operates by a clearly
evident logic, but one quite different from commercial interconnection.
As historians, driven by contemporary politics, try to analogize from
the well-trod grounds of the interlinked Atlantic World to a new Pacific
one, we could do worse than to recognize the differences as well as the
similarities.
~~~~~~~~
A couple due thanks: to
Dael Norwood,
through whom I've gotten most of my information about 19th century
shipping history and who was explaining the importance of Maury's
collecting practices before I knew there was Maury logbook data out
there.
And a plug for the
New Bedford Whaling Museum,
which does a great job curating the history of whaling through actual
material culture.
Which, I'll concede, has its advantages.
The
chronological sequence of increasingly inventive cruelty in the
"Harpoons and Whalecraft" exhibit in the Bourne Building is something to behold.
Links :