Charting a Digital Wilderness
New Social Science Data Service helps students and researchers navigate
massive and complex bodies of data
By Ken Gewertz
Gazette Staff
Joseph Geraci '98 had taken on an ambitious senior thesis. An economics
concentrator, he wanted to determine the impact of crime on New York City
housing values over time.
The project was difficult because there are many other factors besides
crime that might affect the value of a house. A toxic waste dump next door
might lower its value. Good schools in the neighborhood might raise it.
In order to isolate the relationship between crime and housing prices,
Geraci had to control for these other factors, and that was where things
got complicated.
His main difficulty was matching one set of data to another. The boundaries
of a police precinct, for example, do not necessarily match those of a census
tract or a school district. It didn't make it easier that some data was
in electronic form while some existed only on paper. And of those data sets
which were electronified, programming variations often made it extremely
difficult to access the data or to compare one set of numbers with another.
Eventually, Geraci solved these problems. Working with his thesis adviser,
Economics Professor Edward Glaeser, Geraci slogged resolutely through a
largely uncharted digital wilderness, producing a thesis that was nominated
for the Hoopes Prize. Now a researcher at Salomon Smith Barney, Geraci looks
back on his thesis experience with an appreciation of the technological
capabilities available to him as a Harvard undergraduate.
"My thesis would have been simply impossible for an undergraduate
10 years ago. Undergraduates now have the computing power to study quantitative
issues and write empirical theses that 10 years ago could only have been
done by professors with mainframes and huge resources," Geraci says.
Among Geraci's allies in his struggle to hunt down and subdue huge, untamed
data sets was a network of librarians with expertise in the rapidly changing
field of social science data. Even as Geraci was writing his thesis and
during the months since his graduation, the library has been reorganizing
itself to better meet the needs of students who are working with increasingly
massive and complex bodies of data.
"We've been thinking about this for a number of years, how to use
our resources most efficiently," says Diane Garner, librarian for the
social sciences of Harvard College Libraries, who oversees the growth and
development of these new research aids.
"In our every-tub-on-its-own-bottom environment, libraries have
been accustomed to acquiring their own copies of the resources needed by
their clientele. With electronic databases this doesn't make sense. Many
of them are quite expensive, and they can be shared and distributed across
networks to wherever the users are, with the result that duplication is
not only a waste of money, it is also less convenient for the users. But
many electronic databases present challenges to users. Unlike books, you
can't just take them off the shelf and read them. We are creating within
the social sciences program a core of staff to whom users can turn when
they need help. We are trying to get the best of both worlds--centralization
and autonomy," Garner says.
In January 1998, four units within the Harvard College Library were linked
together to form the Social Science Program, which has the aim of enhancing
and coordinating social science collections and services. These units are
Littauer Library, the Environmental Resources Service, the Harvard Map Collection,
and Government Documents and Microforms.
The cornerstone of this new program is the Social Science Data Service
which helps researchers like Geraci navigate within these resources. The
Data Service also collaborates closely with the Harvard-MIT Data Center
(HMDC) to make more data more easily available across the University.
In July Heather McMullen came on board as the social science data librarian
She joins librarians John Collins and John Baldisserotto as part of the
team of specialists who guide students and other researchers through the
vast universe of electronic data available at Harvard and through affiliated
sources.
The sheer quantity of the material is deceptive, due to the form in which
it is stored. True, much of the data is still paper-based, while another
large division resides on microfilm, but the proportions are rapidly changing,
and it is clear to those who order, catalog, and work with this material
that the future belongs to digital formats, the most prevalent being, at
present, the CD-ROM.
Many of these materials are kept in the Government Documents room in
Lamont Library. A cool, quiet, below-ground space lit by a row of high windows,
its banks of file cabinets contain row upon row of CD-ROMs from the departments
of Education, Energy, Defense, Treasury, Transportation, the Environmental
Protection Agency, the U.S. Patent Office, the Census Bureau, and numerous
others.
The U.S. Government is the largest gatherer of statistics in the world,
and, because Harvard is an official depository of government data, much
of it is available here and is open to all users. The library is free to
use its discretion in choosing which data to accept on deposit, but with
the scholarly demand for quantitative information being what it is, library
officials rarely respond in the negative.
"Government agencies ask us if we want specific documents,"
Garner says. "But generally, if it's data, we want it."
Government Documents received about 1,000 CD-ROMs last year, in addition
to large quantities of data on paper and microfiche.
But the holdings of U.S. government deposit resources are only a fraction
of the library's and Harvard's social science data. The library also buys
data from commercial providers, including international statistics on economics,
health, trade, crime, education, politics, and other topics. Harvard users
also have access through the HMDC's Website to a huge repository of social
science data from the Inter-University Consortium for Political and Social
Research and the Roper Center.
In some cases, leading-edge technologies have allowed the library to
acquire astoundingly complete bodies of data. A new set of CD-ROMs from
Britain contains extensive information from the British Public Records Office,
one of the oldest institutions of its type in the world, which has data
going back to the Middle Ages. The discs contain digitized photographs of
original documents that allow users to zoom in to examine written records
in microscopic detail as well as to search the documents by key word. If
it were in paper form, the data would fill a large room.
In other cases, the library's acquisitions are driven by the needs of
enterprising researchers. Recently, a graduate student writing a dissertation
on international economics needed census data from Zambia. Through connections
in the Harvard Institute for International Development (HIID), he hand-delivered
a Zip drive and a stack of Zip disks to the Zambian central statistical
office, where workers downloaded the entire 1991 census onto disks. The
student brought the disks back to Cambridge where the HMDC burned the data
onto CD-ROMs.
"As far as I know, we're the only place in the world outside of
Zambia to have the complete census in this form," Garner says.
Although CD-ROMs provide an incredibly compact and convenient means of
storing data, the library is already pushing ahead toward even more accessible
formats. One plan, on the verge of being implemented, is to set up a computer
server that will contain the most frequently used data sets from the library's
collection of CD-ROMs. Once these are mounted on the server, they will be
instantly accessible from special library workstations. The Social Science
Program is also working with the Harvard-MIT Data Center to mount data on
the HMDC Website (http://data.fas.harvard.edu/), which has an easy-to-use
interface and is available across the University.
In addition to its rapidly growing collection of CD-ROMs, the library
also provides access to significant Websites via HOLLIS Plus. One of the
most useful is the Central Intelligence Agency's World News Connection,
which posts selected, translated articles from foreign news media.
Another unique service involves the library's map collection, which has
been undergoing a parallel development toward digitization. These changes
have allowed researchers to combine statistical data with mapping software
such as Geographic Information Systems (GIS) and Boundary Files to produce
original maps. In many cases, these visual representations have led to further
analytic insights.
"We see a new role for libraries, not just as passive repositories,
but as active purveyors of information. We're always suggesting options
for what students can do," Garner says.
Pausing, she gazes around the cubicle-filled space in the basement of
Lamont in which she and the other librarians work to provide access to statistical
data.
"I'm sometimes in awe of what we have here," she says. "I
want the world to know about it."
Copyright
1998 President and Fellows of Harvard College
|