Every moment data is created.
When a member of the Flint Water Study team tests and records results from a drop of water. When a student steps into Goodwin Hall, activating sensors to track usability and traffic patterns.
But data, especially big data that has to be analyzed computationally, sometimes creates as many questions as it answers. Where does it all go? How do we store it? Who pays to store it? What kind of computer do we need to process the data? And how can we make sure that people years from now will still be able to access and reuse it?
University Libraries, in partnership with Virginia Tech researchers working with big data, is exploring these questions and more with the support of a $308,175 National Leadership Grant for Libraries from the Institute of Museum and Library Services.
The project team includes: Zhiwu Xie, technology development librarian in the University Libraries; Tyler Walters, dean and professor, University Libraries; Edward Fox, professor of computer science in the College of Engineering; and Pablo Tarazaga, assistant professor of mechanical engineering in the College of Engineering. Jiangping Chen, associate professor in the Department of Library and Information Sciences at the University of North Texas, will also help evaluate and review the project.
Libraries have recently supported research and data by hosting data sets, providing repositories for research, helping researchers manage their data, and even building custom infrastructures for storing and reusing big data.
“But libraries are starting to go beyond their capacity,” Xie said. “The big data projects we’re seeing at Virginia Tech and other institutions can hardly be handled using local infrastructures.”
Researchers need libraries to support data projects that require considerable processing power and quicker transfer rates when moving data from storage to processors.
“Much of the research landscape today is computational, and this is an awesome challenge for universities, government agencies, and other types of research institutes,” said Walters. “Researchers need partners like libraries to co-create new strategies and cyberinfrastructures, to advance their research, and sustain its products and findings.”
Walters leads SHARE, a project that generates notifications for researchers when other research is at various stages, such as when an award is made, initial data sets are available, and preprints are shared in a repository. As part of the Digital Library Research Lab, Fox leads the Event Digital Library and Archive, which hosts many terabytes of web and social media content related to large-scale national and international events. Tarazaga runs the Virginia Tech Smart Infrastructure Laboratory, which uses sensors and accelerometers to track movement in Goodwin Hall to better design buildings, save energy, and respond to emergencies.
These projects all use data infrastructures that University Libraries developed or collaborated on, and each one represents one of the three library service models for big data sharing and reuse. The models differ based on whether the data is stored centrally or in distributed storage and how the data is transferred and processed.
“As data becomes pervasive, knowing how to bring key players like the library in can change and facilitate how we move forward with our research in the engineering world,” said Tarazaga.
Together, this team will test and evaluate the performance of each project’s infrastructures, using the results to develop recommended strategies for other libraries and institutions to follow. Xie describes one outcome of the research as a data-sharing-and-reuse decision tree, factoring in not just data types, storage options, and computing needs, but also nontechnical factors, such as financial support and the skills and knowledge of those involved.
“The IMLS grant will allow contrasting use of the cloud with local infrastructures, like ours that is tailored for integrating focused crawling from the web, tweet collection, collaboration with the Internet Archive, and advanced methods of machine learning, natural language processing, information retrieval, digital libraries, archiving, visualization, and human-computer interaction,” said Fox.
Fox oversees the DLRL Hadoop Cluster, which supports computer science courses and research projects that can utilize its 150 terabytes of storage and a 10 Gbps connection to the Virginia Tech Research Network.
Storing such large amounts of data costs money, said Xie. University researchers often write high-performance computing grants for fixed periods to cover the costs of cloud or central storage.
“What happens after the project period?” asks Xie. “We cannot always depend on grants to support data reuse in the long run, so we need to understand our options and test different scenarios.”
Following this research, Virginia Tech’s libraries will be better prepared to support data intensive and big data projects. The recommended strategies developed as an outcome of this grant will also support researchers, libraries, and institutions across the world.
“Sharing, use, and reuse of data in a holistic manner across faculty, colleges, government, and industry is crucial for the community at large to be able to make sense of what is being gathered and to make good use of it as well,” added Tarazaga. “The work produced here will be easily transferable to other groups experiencing the same challenges.”
The Institute of Museum and Library Services is the primary source of federal support for the nation’s 123,000 libraries and 35,000 museums. Its mission is to inspire libraries and museums to advance innovation, lifelong learning, and cultural and civic engagement. Its grant making, policy development, and research help libraries and museums deliver valuable services that make it possible for communities and individuals to thrive. To learn more, visit www.imls.gov and follow IMLS on Facebook and Twitter.