October 29, 1998
Harvard
University Gazette

 

Full contents
Notes
Newsmakers
Police Log
Gazette Home
Gazette Archives
News Office
Feedback

SEARCH THE GAZETTE

 

Charting a Digital Wilderness

New Social Science Data Service helps students and researchers navigate massive and complex bodies of data

By Ken Gewertz

Gazette Staff

Joseph Geraci '98 had taken on an ambitious senior thesis. An economics concentrator, he wanted to determine the impact of crime on New York City housing values over time.

The project was difficult because there are many other factors besides crime that might affect the value of a house. A toxic waste dump next door might lower its value. Good schools in the neighborhood might raise it.

In order to isolate the relationship between crime and housing prices, Geraci had to control for these other factors, and that was where things got complicated.

His main difficulty was matching one set of data to another. The boundaries of a police precinct, for example, do not necessarily match those of a census tract or a school district. It didn't make it easier that some data was in electronic form while some existed only on paper. And of those data sets which were electronified, programming variations often made it extremely difficult to access the data or to compare one set of numbers with another.

Eventually, Geraci solved these problems. Working with his thesis adviser, Economics Professor Edward Glaeser, Geraci slogged resolutely through a largely uncharted digital wilderness, producing a thesis that was nominated for the Hoopes Prize. Now a researcher at Salomon Smith Barney, Geraci looks back on his thesis experience with an appreciation of the technological capabilities available to him as a Harvard undergraduate.

"My thesis would have been simply impossible for an undergraduate 10 years ago. Undergraduates now have the computing power to study quantitative issues and write empirical theses that 10 years ago could only have been done by professors with mainframes and huge resources," Geraci says.

Among Geraci's allies in his struggle to hunt down and subdue huge, untamed data sets was a network of librarians with expertise in the rapidly changing field of social science data. Even as Geraci was writing his thesis and during the months since his graduation, the library has been reorganizing itself to better meet the needs of students who are working with increasingly massive and complex bodies of data.

"We've been thinking about this for a number of years, how to use our resources most efficiently," says Diane Garner, librarian for the social sciences of Harvard College Libraries, who oversees the growth and development of these new research aids.

"In our every-tub-on-its-own-bottom environment, libraries have been accustomed to acquiring their own copies of the resources needed by their clientele. With electronic databases this doesn't make sense. Many of them are quite expensive, and they can be shared and distributed across networks to wherever the users are, with the result that duplication is not only a waste of money, it is also less convenient for the users. But many electronic databases present challenges to users. Unlike books, you can't just take them off the shelf and read them. We are creating within the social sciences program a core of staff to whom users can turn when they need help. We are trying to get the best of both worlds--centralization and autonomy," Garner says.

In January 1998, four units within the Harvard College Library were linked together to form the Social Science Program, which has the aim of enhancing and coordinating social science collections and services. These units are Littauer Library, the Environmental Resources Service, the Harvard Map Collection, and Government Documents and Microforms.

The cornerstone of this new program is the Social Science Data Service which helps researchers like Geraci navigate within these resources. The Data Service also collaborates closely with the Harvard-MIT Data Center (HMDC) to make more data more easily available across the University.

In July Heather McMullen came on board as the social science data librarian She joins librarians John Collins and John Baldisserotto as part of the team of specialists who guide students and other researchers through the vast universe of electronic data available at Harvard and through affiliated sources.

The sheer quantity of the material is deceptive, due to the form in which it is stored. True, much of the data is still paper-based, while another large division resides on microfilm, but the proportions are rapidly changing, and it is clear to those who order, catalog, and work with this material that the future belongs to digital formats, the most prevalent being, at present, the CD-ROM.

Many of these materials are kept in the Government Documents room in Lamont Library. A cool, quiet, below-ground space lit by a row of high windows, its banks of file cabinets contain row upon row of CD-ROMs from the departments of Education, Energy, Defense, Treasury, Transportation, the Environmental Protection Agency, the U.S. Patent Office, the Census Bureau, and numerous others.

The U.S. Government is the largest gatherer of statistics in the world, and, because Harvard is an official depository of government data, much of it is available here and is open to all users. The library is free to use its discretion in choosing which data to accept on deposit, but with the scholarly demand for quantitative information being what it is, library officials rarely respond in the negative.

"Government agencies ask us if we want specific documents," Garner says. "But generally, if it's data, we want it."

Government Documents received about 1,000 CD-ROMs last year, in addition to large quantities of data on paper and microfiche.

But the holdings of U.S. government deposit resources are only a fraction of the library's and Harvard's social science data. The library also buys data from commercial providers, including international statistics on economics, health, trade, crime, education, politics, and other topics. Harvard users also have access through the HMDC's Website to a huge repository of social science data from the Inter-University Consortium for Political and Social Research and the Roper Center.

In some cases, leading-edge technologies have allowed the library to acquire astoundingly complete bodies of data. A new set of CD-ROMs from Britain contains extensive information from the British Public Records Office, one of the oldest institutions of its type in the world, which has data going back to the Middle Ages. The discs contain digitized photographs of original documents that allow users to zoom in to examine written records in microscopic detail as well as to search the documents by key word. If it were in paper form, the data would fill a large room.

In other cases, the library's acquisitions are driven by the needs of enterprising researchers. Recently, a graduate student writing a dissertation on international economics needed census data from Zambia. Through connections in the Harvard Institute for International Development (HIID), he hand-delivered a Zip drive and a stack of Zip disks to the Zambian central statistical office, where workers downloaded the entire 1991 census onto disks. The student brought the disks back to Cambridge where the HMDC burned the data onto CD-ROMs.

"As far as I know, we're the only place in the world outside of Zambia to have the complete census in this form," Garner says.

Although CD-ROMs provide an incredibly compact and convenient means of storing data, the library is already pushing ahead toward even more accessible formats. One plan, on the verge of being implemented, is to set up a computer server that will contain the most frequently used data sets from the library's collection of CD-ROMs. Once these are mounted on the server, they will be instantly accessible from special library workstations. The Social Science Program is also working with the Harvard-MIT Data Center to mount data on the HMDC Website (http://data.fas.harvard.edu/), which has an easy-to-use interface and is available across the University.

In addition to its rapidly growing collection of CD-ROMs, the library also provides access to significant Websites via HOLLIS Plus. One of the most useful is the Central Intelligence Agency's World News Connection, which posts selected, translated articles from foreign news media.

Another unique service involves the library's map collection, which has been undergoing a parallel development toward digitization. These changes have allowed researchers to combine statistical data with mapping software such as Geographic Information Systems (GIS) and Boundary Files to produce original maps. In many cases, these visual representations have led to further analytic insights.

"We see a new role for libraries, not just as passive repositories, but as active purveyors of information. We're always suggesting options for what students can do," Garner says.

Pausing, she gazes around the cubicle-filled space in the basement of Lamont in which she and the other librarians work to provide access to statistical data.

"I'm sometimes in awe of what we have here," she says. "I want the world to know about it."


 


Copyright 1998 President and Fellows of Harvard College