# Dimensionality Reduction as Evidence of a New Regime of Galaxies

###### Currently, researchers believe that there are only two regimes of galaxies, delineated by whether they produce stars or not. However, using a novel machine learning technique to estimate the temperatures of galaxies, we find evidence of a distinct third regime of galaxy evolution. To demonstrate this, we studied roughly 650,000 galaxies in the publicly available COSMOS dataset. Moreover, we identify a group of galaxies that belong to this third regime, providing clear visual evidence of our three-regime model of galaxy evolution.

Author: Patrick Rim
California Institute of Technology
Mentors: Charles L. Steinhardt, Adam Blank
Cosmic Dawn Center, Niels Bohr Institute, California Institute of Technology
Editor: Laura Lewis

## Abstract

Dimensionality reduction is an unsupervised machine learning technique used to discover and visualize patterns in a dataset by reducing its number of dimensions. We show that by applying dimensionality reduction to a dataset of galaxies, we can provide evidence for an updated model of galaxy evolution. The currently accepted model of galaxy evolution defines two regimes of galaxies: star-forming and quiescent. However, using a novel technique called IMF fitting to estimate the temperatures of galaxies, we find that the regime of star-forming galaxies can be subdivided into two distinct regimes. To demonstrate this, we studied roughly 650,000 galaxies in the COSMOS dataset, each of which has flux measurements in twenty-seven photometric bands. First, we preprocessed the data by removing galaxies with incomplete or erroneous measurements and normalized the photometric measurements to a single photometric band. We then applied the t-SNE dimensionality reduction algorithm to the preprocessed dataset, determining appropriate values for t-SNE’s parameters. Observing the resulting plots, we found that t-SNE clustered the galaxies together by regime and that the three clusters representing the three regimes were ordered sequentially on the map. Thus, our study provides evidence of three distinct regimes of galaxy evolution.

## 1. Introduction

The universe is filled with stars, which were all formed at some point in time. Most existing stars were formed in the early stages of the universe, with star formation rates peaking about ten billion years ago [5]. In comparison, galaxies today are not forming stars at nearly the same rate. In fact, star formation rates have dropped to about three percent of that peak [24]. Following such observations, the currently accepted model of galaxy evolution defines two regimes: star-forming and quiescent [12].

We have recently discovered a way to estimate the temperatures of galaxies using a technique called Initial Mass Function (IMF) fitting, described by Sneppen et al. [21]. Studying both the star formation rates and the IMF temperatures of galaxies, we find a small group of star-forming galaxies that have higher temperatures than other star-forming galaxies. Therefore, two distinct groups of star-forming galaxies are observed. This observation challenges our currently accepted model of galaxy evolution, which details just two regimes of galaxies: star-forming and quiescent. Thus, accounting for the newly-discovered group of star-forming galaxies, we propose an updated three-regime model of galaxy evolution.

In this paper, a machine learning technique called “dimensionality reduction” is used to provide experimental evidence of this new third regime of galaxies. In particular, we apply a dimensionality reduction algorithm called “t-SNE” [28], or “t-distributed stochastic neighbor embedding,” to the Cosmic Evolution Survey (COSMOS) dataset of galaxies which contains about one million galaxies and twenty-seven photometric bands for each galaxy [15]. Using dimensionality reduction, we are able to visualize this high-dimensional dataset and correct for numerous issues that arise when analyzing data points in a high-dimensional space. The most important of these issues is that data points are too far apart to be grouped into clusters. [20]. We describe and justify the methods we use to preprocess the COSMOS dataset, such as feature scaling and removing outliers. By preprocessing the dataset, we are able to produce meaningful plots that correctly cluster together physically similar galaxies. We show that the maps produced by t-SNE when applied to COSMOS provide clear visual evidence of our three-regime model of galaxy evolution.

This paper is structured as follows: Section 2 details the currently accepted two-regime model of galaxy evolution and our proposed three-regime model of galaxy evolution. Section 3 describes the COSMOS dataset of galaxies. In Section 4, we introduce and motivate the use of dimensionality reduction, as well as the t-SNE algorithm. In Section 5, we present the maps produced by t-SNE when applied to COSMOS. In Section 6, we summarize and discuss our results.

## 2. Models of Galaxy Evolution

### 2.1 Two-Regime Model

When we observe the galaxies in the Hubble Deep Field, a small region of space surveyed extensively by the Hubble Space Telescope, we see galaxies of various sizes, shapes, and ages [11]. However, in the last fifteen years, a puzzling fact about “star-forming galaxies” has been discovered: star formation rates seem to be consistent across galaxies with varying features and properties. These galaxies comprise the “Star-Forming Main Sequence” [22], in which all star-forming galaxies seem to form stars at the same rate. We also observe “quiescent galaxies,” which are galaxies that produce stars at a low rate. The currently accepted model of galaxy evolution classifies galaxies as either star-forming or quiescent [12]. There are two main reasons that astronomers accept this two-regime model.

First, multiple previous studies have shown that star-forming galaxies and quiescent galaxies can be clearly distinguished on a U-V vs. V-J (UVJ) diagram [14]. The U-V color band contains wavelengths from ultraviolet to visible light while the V-J color band contains wavelengths from visible to near-infrared light. Plotting galaxies of similar redshift and mass on a UVJ diagram that compares each galaxy’s fluxes in the U-V color band and the V-J color band, we see that star-forming galaxies and quiescent galaxies are clustered separately on either side of a distinct boundary (Figure 1).

Second, in today’s night sky, we observe red and blue galaxies, where quiescent galaxies appear red because they are dominated by red, low-mass stars, and star-forming galaxies appear blue because they are dominated by blue, high-mass stars [2]. We do not observe any galaxies that appear neither red nor blue. In this way, all star-forming galaxies appear to be identical—until we account for their IMF temperatures.

### 2.2 Three-Regime Model

Only the two regimes of “quiescent” and “star-forming” are apparent when studying the sSFRs, or specific star formation rates, of galaxies. This changes when we take into account the IMF temperatures of these galaxies, which are estimated with the IMF fitting method discovered by Sneppen et al. [21].

When we plot galaxies by log(sSFR) and IMF temperature, we see that galaxies with high values of sSFR, which are star-forming galaxies, span a wide range of temperatures. For example, in the hexbin plot of galaxies with a redshift value between 0.7 and 0.8, labeled a) in Figure 2, we see that there are galaxies that form a “horizontal branch” that have a different relationship between star formation rate and temperature than most of the other star-forming galaxies. While most star-forming galaxies have a similar temperature, the galaxies in this branch have a significantly higher temperature. This means that not all star-forming galaxies are physically similar, which contradicts the currently accepted theory.

Figure 2 shows six hexbin plots, each contain galaxies in redshift ranges of size 0.2, spanning from 0.5 to 1.7. Galaxies are binned by redshift because galaxies in different redshift ranges will appear different even if they are physically similar. This is because the wavelengths at which their fluxes are measured are significantly redshifted. The hexbin plot containing galaxies with redshift values between 0.5 and 0.7 is labeled, and all other plots should be interpreted identically.

We see that there are two distinct regimes of star-forming galaxies, which are labeled as Region 1 and Region 3. We label the galaxies in Region 1 (the “horizontal branch”) as the Pre-Main Sequence galaxies, because we believe that these galaxies are hot star-forming galaxies that are cooling into the Star-Forming Main Sequence, which we have labeled as Region 3. We have allocated Region 2 as a transition phase between the Pre-Main Sequence galaxies and the Star-Forming Main Sequence galaxies. Region 5 contains the quiescent galaxies and Region 4, similarly to Region 2, is a transition phase between the Star-Forming Main Sequence galaxies and the quiescent galaxies.

## 3. COSMOS Dataset

While observing the spectra of galaxies is likely the most accurate method of studying them, observing a high-resolution spectrum of a galaxy can be expensive and time-consuming. Thus, many astronomers use photometry to study galaxies, which measures the total amount of light emitted by a galaxy in certain “bands” of wavelengths [9]. While observing a high-resolution spectrum of a galaxy can be a long and expensive process, it is much quicker and cheaper to measure the amount of total electromagnetic radiation that the galaxy emits in wider “bands”. It is generally accepted that photometry is an accurate method of studying galaxies because it is unlikely that galaxies emit light in completely arbitrary combinations of wavelengths [17]. Thus, we can model a galaxy as if it is a nearby galaxy if we are able to observe that it emits light in certain “bands”. Previous studies of galaxies using photometry have been able to accurately deduce their properties, so we have evidence that photometry is a valid and efficient method of studying galaxies [17].

The COSMOS dataset of galaxies contains the “band” measurements required to conduct a photometric study. It catalogs about one million galaxies and twenty-seven photometric bands for each galaxy, making it one of the largest multi-wavelength catalogs of galaxies [15]. We see that COSMOS is a twenty-seven-dimensional dataset, where each photometric band is an “attribute” of the COSMOS dataset.

For our study, the flux measurements of the galaxies were retrieved from COSMOS2020, the most recent version of COSMOS published in 2021, while other physical properties of the galaxies were retrieved from COSMOS2015, the version of COSMOS published in 2016. The COSMOS2020 and COSMOS2015 datasets contain the same galaxies, each identified with a common unique identification number.

## 4. Dimensionality Reduction

Dimensionality reduction, an unsupervised machine learning technique, is the process of reducing the number of dimensions, or “attributes,” of a dataset [7]. Rather than keeping a select number of dimensions and discarding the excess ones, dimensionality reduction algorithms create entirely new dimensions that preserve information from each of the original dimensions. Dimensionality reduction algorithms have successfully been used in previous studies to classify galaxies [25]. There are two main reasons that dimensionality reduction is a useful tool for analyzing high-dimensional datasets.

The first reason can best be summarized as the “curse of dimensionality.” In a high-dimensional space, data points are very sparse, meaning that most data points are so sparse that they seem dissimilar [6]. Sparsity is measured using the Euclidean distance metric, which is the standard distance metric used to compute distances between points in a high-dimensional space. Because most data points are extremely sparse, clustering algorithms produce statistically insignificant results when applied to a high-dimensional dataset [1]. Furthermore, when distances are computed in a high-dimensional space, noise may be overemphasized since the noise may be present in many different dimensions [16]. High-dimensional spaces also have properties that are unintuitive to human analysts, since we are more familiar with two and three-dimensional spaces [13]. By applying dimensionality reduction to a high-dimensional dataset, we are able to overcome these issues by working with the data points in just two dimensions.

Another reason why we use dimensionality reduction is that it is hard, if not impossible, to visualize a high-dimensional dataset. In a one-dimensional graph, data points are plotted along a single line, making it very easy to visualize and compare them. In a two-dimensional graph, which is a very commonly used data visualization tool, each data point is ordered from left to right in ascending order in one attribute, and bottom to top in ascending order in the other attribute. As such, two-dimensional graphs are also very easy to visualize. A three-dimensional graph is harder to visualize than a one-dimensional or two-dimensional graph, although it is still feasible by creating an illusion of depth of perception. We may even try to use other visible attributes such as the colors or the sizes of the data points to represent values in four or five dimensions on a graph. However, there are not enough creative ways we can visualize a very high number of dimensions on a two-dimensional surface. There is clearly no feasible way to visualize a dataset like COSMOS which contains twenty-seven dimensions. Applying dimensionality reduction to these high-dimensional datasets allows us to visualize the data points on a standard two-dimensional graph.

### 4.1 t-SNE Algorithm

t-SNE, or t-distributed stochastic neighbor embedding, is one instance of a dimensionality reduction algorithm [28]. The t-SNE algorithm takes as inputs the high-dimensional dataset and the intended dimensionality of the output space, and it embeds the data points into a lower-dimensional space. In the process, t-SNE preserves the nearest neighbors of each data point but does not necessarily preserve exact distances and point densities. For most applications of t-SNE, a two-dimensional output space is selected for visualization purposes. The reduced dimensions created by t-SNE, or any other dimensionality reduction algorithm, are unitless and the numerical values of data points in these dimensions are insignificant.

t-SNE also has an input parameter called “perplexity” [8]. Perplexity can be thought of as the number of nearest neighbors that t-SNE factors in for each point when embedding the data points in a lower-dimensional output space. Choosing an optimal value of perplexity is accomplished through a heuristic approach. In particular, to choose a value for perplexity for a given iteration of t-SNE, a guess is made for the average number of relevant neighbors that each point in the dataset has. The value of perplexity can then be increased or decreased based on a qualitative observation of the map of the output space. Generally speaking, a lower value of perplexity means that t-SNE will preserve the local relationships between the data points, while a higher value of perplexity means that t-SNE will preserve the overall global structure of the data [27].

In order to justify why the t-SNE algorithm was used for this project, we first compare t-SNE to another commonly used dimensionality reduction algorithm: principal component analysis, or PCA. While PCA is a linear algorithm, t-SNE is a non-linear algorithm [10, 18]. PCA reduces the dimensions of a dataset into n dimensions by creating a basis of n orthonormal vectors of the original values and transforming each of the values into a linear combination of the n basis vectors. On the other hand, t-SNE uses a cost function called KL divergence to create a clustering of the data points in n dimensions by minimizing the cost between all of the points [26, 19].

KL divergence, or Kullback-Leibler divergence, calculates the relative entropy between two probability distributions, where relative entropy is the measure of how a probability distribution differs from another probability distribution [26, 19]. For two continuous probability distributions P and Q defined over an interval [a, b], the relative entropy between P and Q, denoted by $D_{KL}(P\,||\,Q)$, is defined as:

$D_{KL}(P\,||\,Q) = \int_a^b p(x)\,\log{\left(\frac{p(x)}{q(x)}\right)}dx$

where p and q are the probability density functions of P and Q respectively.

### 4.2 Applying t-SNE to MNIST

We briefly study the application of t-SNE to the popular MNIST dataset to illustrate how t-SNE works. MNIST is a dataset of 42,000 images of digits from 0 to 9, where each image is 28 pixels across and 28 pixels down [3]. This means that each image contains $28^2 = 784$ total pixels. Each pixel can be considered an attribute, or “dimension”, of the dataset, which makes MNIST a 784-dimensional dataset. For the previously detailed reasons, we apply the t-SNE dimensionality reduction algorithm to cluster and map the images in a two-dimensional output space.

Non-linear dimensionality reduction algorithms like t-SNE can cluster the images in MNIST more effectively than linear dimensionality reduction algorithms such as PCA. This is because an image of a digit composed of a grid of pixels cannot be logically represented by PCA as a linear combination of basis vectors, unlike other objects such as words or numbers. Thus, PCA is unable to accurately cluster images of the same digit together, while t-SNE is able to do so.

As Figure 3 shows, for the MNIST dataset, focusing on the local structure of the clusters by choosing a low perplexity value yields the most accurate and well-defined clusters. This is an unexpected result because MNIST is quite a large dataset, and focusing on global structure tends to work more effectively for large datasets. However, we can reason that a low perplexity value worked well for MNIST because each image only has a few other images that look similar to it.

As seen in Figure 4, t-SNE has accurately clustered together images of the same digit in a two-dimensional space. The way in which t-SNE arranges the distinct clusters of digits is also significant. Consider the green cluster of images of the digit ‘3’ in the middle of the map. An image of a ‘3’ on the left side of the cluster, nearer to the orange cluster of images of the digit ‘1’, is more vertical—resembling a ‘1’. On the other hand, an image of a ‘3’ on the right side of the cluster, nearer to the purple cluster of images of the digit ‘8’, is more round and more closely resembles an ‘8’. Even across clusters, t-SNE places data points near other similar points, which is useful in other real-world datasets such as COSMOS that lack well-defined clusters.

## 5. Applying t-SNE to COSMOS

For the purposes of our study, we worked with a subset of the COSMOS dataset containing only the galaxies that have an IMF temperature assigned by the IMF fitting method. This was done because we need the IMF temperature of a galaxy to label it with one of the regimes of our three-regime model of galaxy evolution. This subset of the COSMOS dataset that we used in our study contains 97,217 distinct galaxies, which is about 9.72% of the entire COSMOS dataset.

Preprocessing a dataset before applying an algorithm such as t-SNE facilitates more accurate and efficient data analysis [23]. In our analysis of COSMOS, we applied two preprocessing techniques: removing erroneous data and feature scaling (normalization).

The COSMOS dataset contains erroneous data in the form of nonmeasurements and nondetections, where some galaxies in the dataset have erroneous or no measurements in some or all of the photometric bands. In COSMOS, the value -99 denotes a nonmeasurement or nondetection for a galaxy in a certain photometric band. By removing the roughly 8,000 galaxies in our subset with flux values of -99 in any photometric band, we were able to create a more accurate output map.

Normalizing the COSMOS dataset was a critically important step because the galaxies are at different distances, which means flux measurements alone cannot be used to compare the luminosities of galaxies at different distances. Consider this example: Galaxy X has flux measurements of 1 in Photometric Band A and 1 in Photometric Band B, Galaxy Y has flux measurements of 1 in Band A and 2 in Band B, and Galaxy Z has flux measurements of 4 in Band A and 4 in Band B. However, Galaxy X and Y are at the same distance away while Galaxy Z is at half the distance of the other two galaxies. According to the inverse square law, Galaxy X and Galaxy Z actually have the same luminosity in both bands. However, our dimensionality reduction algorithms would consider Galaxy X and Galaxy Y to be more similar than Galaxy X and Galaxy Z because the Euclidean distance between them in these two measurements is smaller (1 between Galaxy X and Galaxy Y; $3\sqrt{2}$ between Galaxy X and Galaxy Z). Thus, to fix this bias, we normalize all of the measurements to one band. That is, all of the flux measurements are scaled such that every galaxy has the same flux value in one of the bands, and the measurements in the other bands are scaled accordingly. In this example, Galaxy X would be scaled to have flux measurements of 1 in Band A and 1 in Band B, Galaxy Y would be scaled to have flux measurements of 1 in Band A and 2 in Band B, and Galaxy Z would be scaled to have flux measurements of 1 in Band A and 1 in Band B.

After then applying t-SNE to the preprocessed COSMOS dataset, we found that the primary attribute that influenced the clustering of points was redshift. This was because two similar objects would look completely different at different redshifts because the flux measurements at one set of wavelengths for one galaxy would all be redshifted to a new set of wavelengths for the other galaxy. Due to this, we created ranges of redshifts and created t-SNE plots of galaxies in each of these bands. The redshift ranges range from 0.5 to 1.5 and are of size 0.1, meaning that the first redshift range contains galaxies from redshift 0.5 to 0.6, the second redshift range contains galaxies from redshift 0.6 to 0.7, and so on. Each galaxy in one of these ranges has values of 1+z, which is 1 added to the redshift value, within 10% of each other.

## 6. Results of t-SNE on COSMOS

After experimenting with different values of perplexity, we used a perplexity value of 40 to generate our final two-dimensional t-SNE maps. This is a moderately high value of perplexity.

Figure 5 shows ten two-dimensional t-SNE maps, with each containing the galaxies in COSMOS in the labeled redshift range of size 0.1. In each of these maps, it is apparent that there is a gradual transition from galaxies in Region 1, which are the Pre-Main Sequence galaxies, to galaxies in Region 3, which are the Main Sequence Galaxies, and then to galaxies in Region 5, which are the quiescent galaxies. Galaxies in Regions 2 and 4, which were the transitions, also seem to be properly located on the maps so that the galaxies are arranged roughly sequentially from Region 1 up to Region 5.

We see that in each of the ten t-SNE maps, there appears to be a small cluster of galaxies that are separated from the main, large cluster of galaxies. By coloring the t-SNE maps using various physical attributes of the galaxies, we have identified the galaxies in these small clusters to be galaxies with high amounts of dust emission. Even within these small clusters, we see that galaxies are arranged roughly sequentially by region.

To more clearly illustrate this sequential ordering, we also plotted hexbin plots for the same ten redshift ranges. Each bin is colored by the modal regime, which is the regime that is most frequently represented in that bin. In other words, the mode was the function used to reduce all of the values in a bin to a single value. The ten hexbin plots are displayed in Figure 6, which clearly display the sequential ordering of the regimes.

## 7. Conclusion and Outlook

The two-dimensional t-SNE maps and the hexbin plots cluster together galaxies of the same regime, meaning that pre-main sequence galaxies are most similar to other pre-main sequence galaxies, main sequence galaxies are most similar to other main sequence galaxies, and quiescent galaxies are most similar to other quiescent galaxies. We know this based on our observation that t-SNE tends to cluster together data points that are similar to each other.

We observe that the clusters of regimes are arranged in sequential order, from pre-main sequence galaxies, to the main sequence galaxies, to the quiescent galaxies, with the two clusters of “transition” galaxies also being placed correctly between the three main clusters. As demonstrated with the application of t-SNE to MNIST, we know that t-SNE places data points near other similar points even across clusters. Using this knowledge, we see that t-SNE is indicating that galaxies in one region of our maps are more similar to galaxies in directly preceding or succeeding regions than to galaxies in other regions.

We conclude that the t-SNE maps illustrate and provide evidence for our three-regime model of galaxy evolution, where a hot, star-forming galaxy cools into the Star-Forming Main Sequence, which then transitions into a quiescent galaxy when its blue, high-mass stars die out and star formation ceases.

It is clear that dimensionality reduction is a useful tool that can be used to uncover realities in datasets that are not discernible through mere observation. Specifically, in the field of astronomy, applying t-SNE or other dimensionality reduction algorithms to datasets of other celestial objects such as stars and galaxy clusters may allow us to discover new groups and classifications.

## Acknowledgements

First, I would like to thank my mentor, Professor Charles L. Steinhardt, for his guidance throughout this research project. I would also like to thank Thomas H. Clark, Andrei C. Diaconu, and the faculty at the Cosmic Dawn Center for their support. This project would not have been possible without the generous funding contributed by Mr. and Mrs. Hassenzahl.

## References

1. Armstrong D. P., Mccarthy M. A., 2007, Avian Conservation and Ecology, 2
2. Astrobites 2017, Bulges Are Red, Disks Are Blue…, https://aasnova.org/2017/01/31/bulges-are-red-disks-are-blue/
3. Beohar D., Rasool A., 2021, 2021 International Conference on Emerging Smart Computing and Informatics (ESCI)
4. Brammer G. B., et al., 2011, The Astrophysical Journal, 739, 24
5. Bridge J., 2016, The changing star formation rate of the universe, https://astrobites.org/2016/11/11/the-changing-star-formation-rate-of-the-universe/
6. Cevher V., Hegde C., Duarte M. F., Baraniuk R. G., 2009, 10.21236/ada520187
7. Chetana V. L., Kolisetty S. S., Amogh K., 2020, Recent Advances in Computer Based Systems, Processes and Applications, p. 3–14
8. Devassy B. M., George S., Nussbaum P., 2020, Journal of Imaging, 6, 29
9. Fernandez R. L., et al., 2016, Monthly Notices of the Royal Astronomical Society, 458, 184–199
10. Forsyth D., 2019, Applied Machine Learning, p. 93–115
11. Ghosh B., Durrer R., Sch ̈afer B. M., 2021, Monthly Notices of the Royal Astronomical Society, 505, 2594–2609
12. Gomez-Guijarro C., et al., 2019, The Astrophysical Journal, 886, 88
13. Khoury M., 2018, https://marckhoury.github.io/blog/
counterintuitive-properties-of-high-dimensional-space
14. Labbe I., et al., 2005, The Astrophysical Journal, 624
15. Laigle C., et al., 2016, The Astrophysical Journal Supplement Series, 224, 24
16. Litvinenko A., Matthies H. G., 2009, Pamm, 9, 587–588
17. Masters D., et al., 2015, The Astrophysical Journal, 813, 53
18. Mcinnes L., Healy J., Saul N., Großberger L., 2018, Journal of Open Source Software, 3, 861
19. Nishiyama T., Sason I., 2020, Entropy, 22, 563
20. Peng H., Pavlidis N., Eckley I., Tsalamanis I., 2018, 2018 IEEE International Conference on Big Data (Big Data)
21. Sneppen A. B., Steinhardt C. L., 2020
22. Speagle J. S., Steinhardt C. L., Capak P. L., Silverman J. D., 2014, The Astrophysical Journal Supplement Series, 214, 15
23. Srivastava M., Srivastava A. K., Garg R., 2019, SSRN Electronic Journal
24. Staff S., 2012, Star Formation Sputtering Out Across the Universe, https://www.space.com/18370-universe-star-formation-rate-decline.html
25. Steinhardt C. L., Weaver J. R., Maxfield J., Davidzon I., Faisst A. L., Masters D., Schemel M., Toft S., 2020, The Astrophysical Journal, 891, 136
26. Wang H.-L., 2008, Acta Automatica Sinica, 34, 529–534
27. Wattenberg M., Viegas F., Johnson I., 2016, How to Use t-SNE Effectively, https://distill.pub/2016/misread-tsne/
28. van der Maaten L., Hinton G., 2008, Journal of Machine Learning Research, 9