Coordination:
Principal Investigator: Ricardo Vaca
Scope:
Funding:
Fundação para a Ciência e a Tecnologia – Advanced Computing Projects Call 6th Edition
The Milky Way is composed of multiple stellar populations, each bearing unique chemical and dynamical signatures of the Galaxy’s formation and interaction history. Identifying and classifying these populations within large-scale surveys, such as that conducted by ESA’s (European Space Agency) Gaia space observatory, is essential for reconstructing the Milky Way’s evolution, yet remains computationally challenging due to the high dimensionality and complexity of the data. Recent advances in machine learning, such as manifold learning methods like Uniform Manifold Approximation and Projection (UMAP), combined with density-based clustering like Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN), offer a promising approach to uncovering these structures.
This project aims to develop and optimize a GPU-accelerated pipeline for the large-scale application of UMAP and HDBSCAN to synthetic Gaia-like datasets derived from the Feedback In Realistic Environments (FIRE-2) cosmological simulations. The goal is to systematically explore a wide range of hyperparameters to evaluate their effect on embedding quality, stability, and astrophysical interpretability. Leveraging Deucalion’s GPUs will enable exhaustive hyperparameter sweeps due to GPU processing being more suited to machine learning tasks over CPU, and will allow us to provide a more comprehensive picture of the potential and limitations of this tool.
The resulting framework will deliver a robust, reproducible tool for stellar population differentiation, accompanied by validated metrics, visualization utilities, and publicly released code. Scientifically, it will reveal how manifold learning can extract meaningful Galactic structures from complex, high-dimensional datasets, paving the way for its application to forthcoming large-scale surveys such as the fourth Gaia data release (DR4, expected December 2026) and the deep, wide-field survey from the Large Synoptic Survey Telescope (LSST).
Allocated HPC resources: 99,968 CPU core-hours, 4,000 GPU-hours, 100 GB disk
