Big Data Alternative Clustering
ECE Professors Jennifer Dy & David Kaeli, and CEE Associate Professor April Gu were awarded a $860K NSF grant for “Exploring Analysis of Environment and Health Through Multiple Alternative Clustering”.
Abstract Source: NSF
While many disciplines have become increasingly exploratory given that large-scale and multi-source data collection has become prevalent, we find that volumes of data that were carefully collected and studied to answer project-specific questions, are neglected and unstudied, even if they hold key answers to tomorrow’s questions. This project addresses this challenge by developing novel data analysis, alternative clustering, data visualization and acceleration solutions to enable exploration and identification of connections hidden in diverse data sets, leading to new discoveries and knowledge. In particular, the data analysis algorithms will be applied to a large dataset taken from an ongoing National Institute of Environmental Health Sciences (NIEHS) project that is assessing the impact of water-borne pollutants on premature birth rates in Puerto Rico. The exploratory analysis will focus on discovering the unknown underlying environmental factors and processes that may more broadly impact health and the environment. As such, this study promotes progress in data science, environmental science and health. Note that this project will address women?s health in an under-served population. In addition, this project supports education through graduate research support, development of inter-disciplinary tutorials, and creation of a new undergraduate class that addresses the intersection between machine learning approaches and parallel computing.
The environmental health data comprises of multiple heterogeneous sources with varying temporal and spatial resolutions: mass spectrometer readings to identify targeted and non-targeted compounds in well and tap water, participant surveys detailing personal care and household products use in the home, and analyzed placental, blood and urine samples. Such complex data challenges traditional clustering algorithms in the following ways. The first challenge is in defining the appropriate similarity measure for each type of data source. The second step involves how to integrate information from these multiple sources for clustering. The third challenge is that in exploratory analysis, the solution found may not be what the analyst is looking for. How can one discover alternative solutions given this knowledge? In real world applications, data can often be interpreted in many different ways. However, existing multi-source fusion methods can only find a single solution. This study will develop new alternative clustering approaches, exploring multiple heterogeneous information sources. The project will deliver both visual and scalable solutions to enable sifting through the mountains of data efficiently. Moreover, the project will produce parallel libraries within Spark, and demonstrate the power of these new methods to this environmental health application.