Keeping Supercomputers Running Effectively

August 18, 2021

ECE Assistant Professor Devesh Tiwari has developed methods to find and predict hardware failures and optimize the data storage in the world’s fastest supercomputers.

He troubleshoots the world’s fastest supercomputers, where system failure can cost millions

Main photo: From battling the coronavirus to modeling the forces responsible for the creation of galaxies, supercomputers are helping to solve some of the most pressing problems in the world today. Photo by Matthew Modoono/Northeastern University

From battling the coronavirus to modeling the forces responsible for the creation of galaxies, supercomputers are helping to solve some of the most pressing problems in the world today.

But these mammoth high-performance computing systems, some of which require football-field-size floor space and tens if not hundreds of miles of cabling to store and operate, are prone to numerous kinds of system failures, glitches, and bugs. These problems, which are notoriously hard to predict, can be costly—causing lost money and productivity, says Devesh Tiwari, an assistant professor of electrical and computer engineering at Northeastern.

Tiwari has been working on how to best identify these large-system vulnerabilities and recently earned a Rising Star in Dependability Award at the 51st annual International Conference on Dependable Systems and Networks for his work on improving the reliability and cost-effectiveness of supercomputers.

Using his experience as a staff scientist at the Oak Ridge National Laboratory in Tennessee, which houses the world’s second most powerful supercomputer—and the nation’s first—called Summit, Tiwari developed methods for rooting out hardware failures, predicting future ones, and optimizing data storage.

Devesh Tiwari, Assistant Professor, Electrical and Computer Engineering. Photo by Matthew Modoono/Northeastern University

Ever-larger computer systems that perform increasingly complex tasks, and which rely on enormous amounts of power to operate, need to be reliable, Tiwari says, which in computing parlance is a measure of, among other things, how well a system can withstand threats and be repaired if there is a hardware failure.

It’s been a “famous problem” over the last couple decades—improving reliability and reducing costs—and one that many in the field are at work trying to solve using federal funds, Tiwari says. These improvements have implications across a range of sectors that rely on these sophisticated computer systems, from weather modeling and medical research to national security and military operations.

“You have all of these really large supercomputers that are trying to solve really important problems,” Tiwari says. “This is why reliability is so important.”

The “rising star” award, given to a researcher within a decade of starting their field work, recognizes Tiwari for work that he says is largely theoretical, but that has been successfully applied to real-world supercomputers—bridging a longstanding divide between theory and practice that he says has stymied collaboration between academic researchers and systems administrators for years.

“Theory work is generally not welcomed by practitioners,” Tiwari says. “What I did was show that my work impacted real systems—even though it was theoretical.”

Most advanced nations are competing to build the fastest supercomputer, Tiwari says. Such progress is being tracked by TOP500, a website that ranks the world’s best performing supercomputers, measured in terms of “petaflops,” or processing speed (floating point operations per second).

Currently, Japan’s Fugaku is the most powerful supercomputer in the world. The U.S. takes the second and third spots, followed by China in fourth. But China could be the first nation to operationalize a so-called “exascale” supercomputer, Tiwari says, which would in theory be faster than Fugaku.
But so-called “petascale” supercomputers, which are the quickest systems that presently exist, need to use up to 20 megawatts of power, Tiwari says. An individual computer failure within this network can create a drag on the system, costing potentially millions of dollars in power consumption within days.

These failures are of the utmost importance to catch and prevent from recurring, given the costs involved, Tiwari says.

“That’s a lot of money that you could have invested somewhere else,” Tiwari says. “Such energy can power a whole village in a developing country, like in India or Taiwan.”

by Tanner Stening, News @ Northeastern

Related Faculty: Devesh Tiwari

Related Departments:Electrical & Computer Engineering