For more than 15 years, the Fragile Families and Child Wellbeing study led by Princeton University and Columbia University has worked to improve the lives of thousands of disadvantaged children in America. Families participating in the program have enabled researchers to study many factors about the children, their parents, their schools, and their environments. They have influenced policymakers not only in the United States, but around the world.
This long running study gave birth to a data analytics competition -- the Fragile Families Challenge -- which, in 2017, asked this of participants: Given all the background data from birth to age nine, as well as some training data from year 15, how well can you conclude six key outcomes in the year 15 test data?
Brian Goode, research scientist at the Virginia Tech Discovery Analytics Center in the National Capital Region, received one of two Innovation Awards from the Fragile Families Challenge for his submission, which looked at both data-driven and process-driven approaches to create predictive models for six outcomes of 4,242 participants. One predictive model was submitted for each outcome category: Grade Point Average, Grit, Material Hardship, Layoff, Eviction, and Job Training.
“My submission placed a focus on understanding the ‘process behind the data,’” said Goode, “I believe this helps to better understand assumptions we might make while working with data, such as when filling in missing values in the dataset.”
“Of the nearly 44 million data points in the feature set, 55 percent of these values were either null, missing, or otherwise marked as incomplete,” Goode said. “These data amount to a substantial information loss and discarding it can potentially skew the data if there is any systematic reason as to why the nulls appear in the rows that they do.”
To the degree possible, Goode said, he made use of the survey questionnaire to establish imputation rules based on the survey structure and familial proximity.
Goode acknowledges Dichelle Dyson, a Discovery Analytics Center summer intern from Friendship Tech Prep Academy, Washington, D.C., and Samantha Dorn, project manager, Roux Associates, Arlington, Virginia, for their assistance in validating code for matching questions in online surveys.
Goode was invited to describe his full process in a blog entry on the Fragile Families Challenge website.
In addition to the Innovation Award, Goode’s challenge submission was ranked fifth and ninth, respectively, in the Material Hardship and Layoff categories.
“It was exciting to be a part of the Fragile Families Challenge and helping to bring more emphasis to reproducible social science research. However, this is just the start, and we have much work to do to interpret these models and connect research to policy,” Goode said.
Goode presented his work at the Fragile Families Challenge Scientific Workshop, Nov. 16-17, at Princeton University. He has also coauthored a related paper, with Debanjan Datta, a Ph.D. student at the Discovery Analytics Center, and Naren Ramakrishnan, the Thomas L. Phillips Professor of Engineering in the Department of Computer Science and director of the Discovery Analytics Center, which is currently under journal review for publication.
“Real data is often messy, incomplete, or otherwise unreliable and Brian’s solution pays careful attention to how gaps come about in the data and how to be systematic in handling them,” Ramakrishnan said
The Fragile Families Challenge is based on the Fragile Families Child Wellbeing study, a joint effort by Princeton University’s Center for Research on Child Wellbeing and Center for Health and Wellbeing, and the Columbia Population Research Center and the National Center for Children and Families at Columbia University.
The study includes information on attitudes, relationships, parenting behavior, demographic characteristics, mental and physical health, economic and employment status, neighborhood characteristics, and program participation. In-home interviews also provide data on children’s cognitive and emotional development, health, and home environment.
“The Discovery Analytics Center faculty and students seek to translate our algorithmic research into best practices and are enthusiastic contributors to analytics competitions such as the Fragile Families Challenge,” said Ramakrishnan.
Written by Barbara L. Micale