Northeastern Researcher Wins Meta Award to Quarantine and Vaccinate Silent Data Corruptions

Most of us have experienced that terrifying moment when a computer program unexpectedly quits or our computer screen suddenly goes black, sending us into panic mode over possibly losing hours of work.

“You would say, ‘Oh, my computer crashed,’” says Devesh Tiwari, associate professor of electrical and computer engineering in Northeastern’s College of Engineering. “But it is actually a good thing that it crashed when an error happened.”

Although it sounds counterintuitive, seeing symptoms of an error is a much better outcome than being unaware that the program produced an incorrect output due to a silent data corruption, he says.

Tiwari recently received a 2022 Meta Research Award among five U.S. and international scientists for his proposal to develop a quarantine and vaccination framework to mitigate silent data corruptions in large-scale systems. The two other awardees from the U.S. are faculty from Stanford University and Carnegie Mellon University.

Meta, the parent company of Facebook, Instagram and WhatsApp, awards funding to academics who propose research in specific areas that are of interest to the technology conglomerate. In this case, Meta wants to make sure that its applications are running and delivering services to billions of users via a fleet of servers as needed, without undetected problems.

“Silent data corruption in a broader sense is a part of improving computer systems resilience to faults,” Tiwari says.

As systems are becoming larger and larger, the applications are prone to more errors, he says. The problem is called silent corruption because these data errors remain undetected by the system, which might result in incorrect application-level output.

The real-life implications can vary. Imagine a scientist running a simulation to discover a new galaxy, Tiwari says. If some data gets silently corrupted and the computer returns a negative result, the scientist won’t know that this result is incorrect. The scientist will likely conclude that there is no galaxy there.

Even worse, envision yourself in a self-driving car, while you see a pedestrian approaching the crosswalk. The computer takes a photo and asks the program about what that object is.

What if the computer decides it is a bird or a mouse instead of a pedestrian? The car won’t stop and will continue driving.

On the hardware side, Tiwari says, it means that certain particles strike a computer chip, and a bit of information gets flipped from zero to one or from one to zero. And if the error is not noticed right away it can lead to bigger problems.

Failures or errors can affect nearby computer nodes in a large-scale computing facility.

“If some failure happens at one particular location, we empirically observed the probability is pretty high that other nodes nearby will also experience a failure, possibly due to shared circuitry, thermal and environmental conditions,” Tiwari says.

Tiwari, who is also among the affiliate faculty at Khoury College of Computer Sciences and Global Resilience Institute, has been working on issues of reliability for a long time. In 2015, his collaborator Saurabh Gupta, who now works at AMD, and others at Oak Ridge National Lab were the first to experimentally observe this phenomena of spatial locality in failures.

“Saurabh was a fantastic collaborator,” Tiwari says. “He came up with this remarkable idea of quarantining the neighboring nodes of the node where the failure occurs to avoid the impact of failures on other nodes. This idea of quarantining the computer nodes pre-dates the COVID pandemic when quarantining became popular among humans.”

They saw these and similar variants of these ideas being picked up and tested at other labs including some studies from Meta.

In his proposal to Meta, Tiwari suggested further development of a quarantine and vaccination framework for silent data corruption mitigation in large systems. Working with Meta allows researchers like Tiwari to do large-scale experiments on hundreds of thousands of machines using the company’s data centers as well as collaborate with the company’s researchers.

As for vaccination, there is no real vaccine for the nodes, Tiwari says, because you don’t physically inject anything into a computer system that will cure it permanently. In this case a vaccine would be a lightweight health checkup of the system.

Tiwari wants to develop a framework that would monitor the health of the system and combine prevention and mitigation of errors.

The framework will autonomously and automatically detect silent corruption, put nodes at risk into quarantine for a specific period of time, fix the error and conduct testing on non-critical tasks.

“I think the techniques would be largely applicable to different places,” he says.

The Meta award is unrestricted, meaning that there is no restriction on how the $50,000 is spent. Tiwari said he’s hoping to work on this subject with his Ph.D. students for multiple years.

by Alena Kuzub, News @ Northeastern

Related Faculty: Devesh Tiwari

Related Departments:Electrical & Computer Engineering