HARMONIA: Improved Data Center Resource Management
ECE Assistant Professor Devesh Tiwari, in collaboration with Mississippi State University, is leading a $500K NSF grant for creating “HARMONIA: New Methods for Colocating Multiple QoS-Sensitive Jobs.” The project will provide a family of novel unconventional resource strategies leveraging the principles of Bayesian Optimization (BO), but introducing novel innovations to BO and demonstrating its usefulness toward data center resource management.
Abstract Source: NSF
Data centers and high performance computing (HPC) systems are considered the backbone of all modern-day computational needs for services ranging from Web search to emails; from video streaming to file sharing; from social media platforms to scientific computing. Today, contemporary data center job schedulers employ conservative resource sharing strategies among applications co-running on the same physical server. Current strategies are conservative to ensure that tight latency requirements for latency-critical applications are met; however, this conservatism leads to huge underutilization of expensive computing resources, which incurs both capital and operational expenses. HARMONIA proposes a family of novel unconventional resource strategies leveraging the principles of Bayesian Optimization (BO), but introducing novel innovations to BO and demonstrating its usefulness toward data center resource management. HARMONIA will capture the impact of resource allocation on application performance using BO-based learning models, and partition the shared resources and adjust hardware/software knobs accordingly to maximize the performance of individual applications and the system utilization. To achieve practicality and scalability, HARMONIA employs a pool of approximately-accurate online learning models which are lightweight instead of a heavyweight, fully-accurate model. Incoming applications are placed and co-located with existing applications in a dynamic, efficient, and non-intrusive manner by the HARMONIA runtime framework.
Outcomes of this project will influence and impact the operations of modern data centers, which serve our computational needs for a variety of workloads including short-running latency-critical application (e.g., machine learning inferences, web search queries, microservices) and long-running throughput-oriented workloads (e.g., scientific simulations). Improving the utilization of large-scale data centers and HPC systems will lead to better cost savings and a lower carbon footprint. Planned educational and outreach activities for the project HARMONIA include enhancing graduate coursework and introducing a new monthly podcast on “concepts in computer systems” to better engage and prepare high school students. All developed tools, software artifacts, measured datasets will be made available to the research community for further enhancing the project outcomes and their impact.
This award reflects NSF’s statutory mission and has been deemed worthy of support through evaluation using the Foundation’s intellectual merit and broader impacts review criteria.