Virginia Tech has a history of breaking boundaries when it comes to supercomputing.

In 2003, System X debuted as the world’s largest supercomputer housed at a university. A decade later, HokieSpeed brought new capabilities for visualization and machine learning, while the BlueRidge cluster provided unprecedented capacity for computational research at the university.

Where size and power are concerned, however, all of these pale in comparison to TinkerCliffs, Virginia Tech’s new flagship high-performance computing (HPC) system, which is now available to the university community through Advanced Research Computing (ARC), a unit of the Division of Information Technology. With this addition, ARC now offers five supercomputing clusters to meet the broad research needs of the Virginia Tech community.

“Access to advanced high-performance computing is critical to the success of a growing number of Virginia Tech’s researchers and research institutes and centers,” explained Scott Midkiff, vice president for information technology and chief information officer. “TinkerCliffs and other ARC systems provide that access needed to remain competitive in computation-based research.”

Terry Herdman, associate vice president for research computing, added, “since the installation of BlueRidge in 2013, the demand for HPC resources at Virginia Tech has consistently increased. Our users asked for a powerful central processing unit (CPU) cluster to help them complete larger and more complex operations.

“TinkerCliffs is a true 'workhorse' cluster that will provide the computing capacity for our users to advance their research, which in turn will strengthen their ability to secure external funding in a competitive research environment.”

Thanks to the latest high-density technology, TinkerCliffs provides nearly 10 times the computing power of its predecessor

TinkerCliffs is a 332-node cluster that contains an impressive 41,984 computing cores and over 93 terabytes (TB) of memory. In comparison, BlueRidge, which TinkerCliffs replaced, contained 408 nodes, with 6,528 cores and 27.3 TB of memory, while Cascades, now the second largest cluster after TinkerCliffs, contains 236 nodes with 7,288 cores and 40 TB of memory.

A key factor that sets TinkerCliffs apart from previous supercomputers at Virginia Tech is its considerably higher density, which provides capabilities unavailable just a few years ago. TinkerCliffs contains 128 cores per node, whereas BlueRidge contained 16 cores per node. This density enables researchers to employ multiple modalities to tackle very large and complex calculations, simulations, and models — and a large number of these at a time — far more quickly and efficiently than with previous clusters. 

ARC computational scientist Matthew Brown elaborated on the benefits of this new, higher-density technology: “Even with specialized super-low latency, high-speed interconnecting networks, the overhead required to synchronize calculations across multiple nodes can be a serious restriction when scaling HPC workloads to a large number of cores. By packing more cores into a single node, it takes fewer nodes to reach a particular number of cores. Computations will be able to complete faster, or even to run at scales which were previously out of reach. By providing a massive number of cores, as well as a high core density, TinkerCliffs will be able to handle far more jobs simultaneously than previous clusters could.”

Anticipating and building the infrastructure to meet Virginia Tech’s computational research needs now and into the future

Like any technology, HPC hardware is constantly evolving, allowing for more sophisticated computational research methods. To ensure that Virginia Tech researchers have access to the most capable HPC infrastructure to meet increasingly complex and demanding research needs, ARC upgrades its clusters on a regular basis.

When planning for the new cluster, ARC asked its users to share their current and future HPC needs. Computational and data scientist Robert Settlage, who helped to coordinate the selection and purchase of TinkerCliffs, said, “After a couple of rounds of graphics processing unit (GPU) centric purchases (e.g., P100 and V100 nodes), our users indicated a need for a CPU system for codes that do not translate well to GPU compute models. The Advanced Micro Devices (AMD) processors quickly became a front runner when considering both performance and physical limitations such as power and cooling. This system represents a huge leap forward in performance and capability that will augment VT’s research needs.” 

Computational scientist Justin Krometis added, “You can’t really find anything newer than what we have in TinkerCliffs.”  

By December 2019, the hardware, and the name, for TinkerCliffs had been chosen, and BlueRidge was decommissioned. Why TinkerCliffs? According to Matthew Brown, “starting with BlueRidge, we got in the habit of naming our clusters after geographical landmarks in the region. TinkerCliffs just seemed appropriate for a cluster of this scale.”

Installing Virginia Tech’s largest supercomputer during a global pandemic

Originally, the Division of IT planned to install TinkerCliffs in the Andrews Information Systems Building, where ARC’s other four clusters (Huckleberry, Cascades, DragonsTooth, and NewRiver) are located. However, the recent merger between ARC and the research computing team from the Fralin Life Sciences Institute provided the opportunity to house TinkerCliffs in Steger Hall.

Kevin Shinpaugh, director of HPC operations for ARC, said that “the Steger Hall data center facility is ideal for a cluster of this magnitude.” The building can handle more than 900 kilowatts of power output at any given time (at full load, TinkerCliffs can put out 200 kilowatts), and the data center is outfitted with a cooling system that utilizes both a network of pipes that constantly flow with fluid chilled to 40 degrees Fahrenheit, and a roof that provides free cooling at outdoor temperatures below 50 degrees Fahrenheit.

The cluster was originally slated to be installed in April 2020 and made available for use by early summer. However, even supercomputers are not immune from the manufacturing and supply chain issues experienced by almost every industry in the wake of the COVID-19 pandemic. 

“Parts for the cluster were delayed due to the pandemic,” Shinpaugh said. “While the original plan was to build the cluster and install software at the factory, eventually we made the decision to have the system built and shipped to Virginia Tech, so that we could meet our delivery deadline. This meant we had to finish the process of installing and testing the software on site.”

Systems engineer Bill Marmagas said, “Typically, we would give the hardware manufacturer our software specifications to complete installation and testing before the system came to us. Instead, we worked with them to install and test the software remotely.”

The expertise, resilience, and adaptability of the ARC systems engineering team is reflected in the fact that despite these delays and having minimal staff on site (with Virginia Tech on essential operations status at the time), TinkerCliffs was in operation by August. Initially, ARC opened TinkerCliffs to a handful of faculty researchers who had a long-standing relationship with the computational science team.

Brown explained, “While TinkerCliffs had been thoroughly tested by the systems engineering team, there are a few architectural differences that may affect how some users need to design their software to run on the new cluster. We wanted to work with a limited number of researchers at first, and use their feedback to fine-tune our processes and the system’s configuration, so we could provide the best service when we fully opened TinkerCliffs to the Virginia Tech community.”

Ying Zhou, associate professor of geophysics for the College of Science, was one of the first to perform computational research using TinkerCliffs. She has already noticed a difference in how the cluster can handle the complex operations that her research demands. “This new cluster will definitely accelerate discoveries in the Earth sciences," Zhou said. "For example, we are now able to simulate hundreds of earthquakes in a couple of days; those waveform data will provide critical constraints for imaging the interior of Earth at a global scale, including the planet’s mantle and core.”

ARC is now welcoming all interested current and prospective users to learn more about the cluster’s capabilities for their research. For more information about TinkerCliffs and other HPC resources at Virginia Tech, please visit www.arc.vt.edu. To begin using ARC’s HPC resources, please request a consultation through the 4Help Service Catalog