Yeh Leads $1M NSF Award to Redesign World’s Largest Big Data Network

Professor Edmund Yeh, of the Department of Electrical and Computer Engineering at Northeastern University, was recently awarded a $1 million, two-year National Science Foundation (NSF) grant, entitled “CC* Integration: SANDIE: Software Defined Network-Assisted Named Data Network for Data Intensive Experiments.” Northeastern is the lead on this multi-university initiative, working with the California Institute of Technology and Colorado State University.

The project team will redesign the Large Hadron Collider (LHC) high energy physics program network, which connects CERN, the European Organization for Nuclear Research in Geneva, Switzerland (the birthplace of the internet), with more than 170 sites in the U.S. and around the world.  The LHC program is the world’s largest data intensive application, estimated to handle 1 Exabyte (1 million terabytes) of information by 2018. It is accessed by physicists around the world to make Nobel-prize-winning discoveries such as the Higgs Boson.

The significant volume of data being distributed, processed, accessed and analyzed within complex data intensive workflows requires a next-generation distribution and delivery system that is coordinated with scientific enterprise operations. “Today it takes scientists hours or days to access and process data, and current networks cannot keep pace with exponential growth projections. Networks need to operate like data warehouses, where the design is based on naming data objects rather than network communication endpoints, and content needs to be cached in strategically placed nodes in the network core as well as at the edge sites,” explained Yeh.

Building upon Yeh’s previous research on Named Data Networking (NDN)—an NSF Future Internet Pro­gram project that seeks to restructure the internet for data distribution, the research team will develop and deploy NDN-based servers supported by advanced Software Defined Network (SDN) services to meet the challenges facing data intensive science.  SANDIE will build on the use of NDN protocols and services, and integrate them with the SDN methods and subsystems already developed and under continued development by the Caltech, Northeastern and Colorado State high energy physics, network, and computer science teams.  NDN edge and core caches will be strategically deployed at universities and scientific institutions around the world to facilitate distribution of LHC data from CERN.

Specifically, the project team will:

  • Derive an NDN-based operational model for data analysis that would benefit LHC as well as other major data-intensive scientific program such as Large Synoptic Space Telescope (LSST) and the Square Kilometer Array (SKA) astrophysics survey
  • Optimize data distribution and access performance in a highly-distributed petabyte-scale storage and computing system
  • Optimize network core and edge node design and performance optimization
  • Develop appropriate naming schemes and attributes for fast access and efficient communication in high energy physics and other data intensive science fields
  • Scale the NDN system in both the amount and types of data supported as well as geographical extent
  • Adapt NDN to take advantage of SDN-enabled infrastructures

Scientists from all over the world use the LHC network, remotely accessing massive amounts of data. Implementing a data distribution architecture has the potential to reduce the time for content access and the ability to extract a wealth of knowledge, whether subtle patterns, small perturbations or rare events, thereby enabling the acceleration of science.  For the future, the team’s vision is to implement the technology for other scientific large data applications, including climate science, genomics, and biomedical research.

Related Faculty: Edmund Yeh

Related Departments:Electrical & Computer Engineering