HARVARD GAZETTE ARCHIVES
James Robins makes statistics tell the truth
Numbers in the service of health
By Elizabeth Gehrman
Special to the Harvard News Office
The white board that covers hundreds of feet of the curved hallway at the Institute for Quantitative Social Science (IQSS) is not always covered with equations - but lately, it usually is. And most of them are in the haphazard hand of James M. Robins, an IQSS faculty associate and a professor of epidemiology and biostatistics at the Harvard School of Public Health. "I'm not the most organized person in the world," says Robins, his chair rolling over a splash of papers that spill out of his briefcase and onto the floor of his office. "So the equations usually sit there for awhile before I type them into my computer."
It seems a metaphor for the path he has taken in life - circuitous but ultimately inevitable. It began when Robins, then a junior resident at an occupational-health clinic he and a friend had started at the Yale-New Haven Medical Center, started learning about statistics while researching workers' compensation cases.
Having taken "more abstract stuff" as a Harvard undergrad, he says, "I didn't know what this stuff was." He took some statistics courses but was mostly self-taught, applying Baysian statistics to epidemiological concepts and learning the foundations and principles of statistical inference along the way.
"Epidemiology is a very strange field," he says. "Almost every textbook was called 'Intro to ...' because no one understood what to do about data on real exposures that vary over time." For example, workers with the highest exposure to a particular harmful chemical should have more of a particular disease; but real-world "confounders" skew study results. For example, people who start to get sick are likely to leave work and get less exposure - but it's hard to determine who leaves because of illness and who leaves for other reasons.
"It's a very hard problem," Robins says. "And basically, I spent the next 20 years thinking about it. I figured out a statistical trick that can turn the observational data we have into data we would have seen if we had done the study randomly." This creates a new data set in which some patients are copied more than once, depending on their probability of getting the treatment they actually did get based on doctors' patterns. This creates a "pseudo-population" that is essentially the same as a randomized cohort.
This is not the only statistical innovation Robins has come up with over the years, but, he says, "it's [the] easiest to explain, believe me."
The model has caught on in statistical circles as high as the FDA - and Robins has "gone off in a completely new direction" that is so complicated he presumes it will occupy him for the rest of his life.
"Statistics is sort of divided between nonparametric statistics on the one hand and parametric and semiparametric on the other," he says. "Nonparametric statistics make no a priori assumptions about the shape of a curve, and parametric and semiparametric statistics assume the curve can be described by a simple mathematical function such as a straight line or a parabola. The ways people think about and analyze data are very different, depending on whether they take a nonparametric or more parametric approach. I had a feeling there should be one unified story for everything."
Along with Amsterdam stochastics professor Aad van der Vaart, graduate students LingLing Li and Eric Tchetgen are helping. And they actually understand Robins as he stands before the white board saying, "It uses the twicing kernel with leave-one-out, right?"
The ultimate purpose of the unified theory, he explains, is to allow for more accurate estimates of uncertainty. "At some point I suddenly thought I knew which papers and ideas had the kernel," he continues, "what direction I had to go. Once I realized what that was, I realized what I would have to do is incredibly daunting. It has hundreds of layers. We go to the next layer and think we're done, and there's another layer inside it. It's endless."
Whether the theory will end up transforming modern statistics is still unclear. "No fancy statistical analysis is better than the quality of the data. Garbage in, garbage out, as they say. So whether the data is good enough to need this level of improvement, only time will tell."