## Identifying Optimal Proteins by Their Structure Using Graph Neural Networks

###### Proteins are the molecules of life, essential for a wide range of functions from protecting the body from viruses to building the structures of all living things. Protein engineering has been used to repurpose these machines, and proteins can be optimized for new, valuable functions for technological, scientific, and medical applications. New approaches to protein engineering leverage machine learning to find the best proteins for a given application with limited experimental characterization. A machine learning model can be used to predict how well a protein performs a given task, in other words, its “fitness”. The way in which we computationally represent proteins is crucial in determining how these machine learning models predict protein properties. While most approaches use a protein’s sequence to represent proteins, we hypothesize that protein structure provides richer information for the model to learn these properties. We represent each protein of interest as a graph, or a network of amino-acid connections in the protein, and implement a graph machine learning model to predict a protein’s fitness. We show that this structure-based model has superior performance to sequence-only approaches for fitness prediction. We further extend this model to automatically find the best protein for a given task by optimizing a protein’s graph representation. We demonstrate that this novel approach is the most successful in finding the best protein from a small number of proposed proteins.

Author: Marianne Arriola
University of California, Santa Barbara
Mentors: Frances Arnold and Kadina Johnston
California Institute of Technology
Editor: Wesley Huang

## Using Machine Learning to Explore the Non-natural World

Protein engineering has been used to uncover and optimize protein sequences that are not found in the natural world but have valuable functions for technological, scientific, and medical applications [1]. Protein engineers hope to find proteins with maximal “fitness,” a measure of how well a protein performs a desired function – for example, proteins that are strongest at enacting responses for disease treatment or proteins that are the most stable in different environments. Machine learning has accelerated the lab-based search for optimal proteins by computationally predicting protein fitness [1].

To predict fitness, one represents the protein in a data structure understood by a model so that it extracts as much useful information about the protein as possible. Informative protein representations are essential; they determine how the model learns from a protein’s characteristics to predict its fitness. The most common approaches represent proteins as sequences of amino acids, the basic building blocks of proteins [1]. However, protein structure is crucial for function and therefore may better represent proteins by encoding how a protein twists, folds, and bends to determine its function. Recent work demonstrates that graphs, a data structure used to describe relationships between entities, efficiently represent protein structure [3]. A protein can be instantiated in a graph through nodes (amino acids) and connecting edges (distances between amino acids) (Figure 1). Previous work has found that these representations provide higher accuracy in predicting structure-function relationships than sequence-based representations [3]. Though graph representations have been used to classify protein function, to the best of our knowledge they have not yet been used to predict protein fitness. Therefore, we propose graph-based approaches to represent and evaluate proteins.

## Research objectives

We test whether structure provides more information than sequence for fitness prediction

We hypothesize that a structure-based approach will result in higher predictive accuracy than sequence-based approaches because structure directly impacts protein function and therefore may provide more useful information to predict fitness.We implemented a graph convolutional network (GCN), a machine learning model that learns on graph-structured data, to predict protein fitness using structural information which would enable us to identify high-functioning proteins. We then measured the advantage that a structure-based prediction model has over a sequence-based one by comparing their predictive performances. Moreover, this approach can validate or potentially improve upon computational fitness prediction in the Arnold Lab for optimal protein identification.

We explore whether fitness prediction models can predict activity on different substrates

Protein fitness is dependent on the identity of the substrate, or protein-activating molecule, that the protein performs activity on. We propose that activity on one substrate cannot predict the activity on other substrates because a machine-learning model is tuned and trained based on the patterns in the data it is given (in this case, activity on a particular substrate). The model can have less predictive accuracy when given data with different trends (activity on alternate substrates).

Substrate structure is also crucial to protein activity: protein activity can vary widely in response to subtle changes in substrate structure. When nature has evolved a protein for a specific substrate, that activity is referred to as natural activity. Activity on substrates with fundamentally different structures is termed non-natural activity, and, importantly, it is an activity that nature has not evolved for and must be engineered. We explored model performance of predicting non-natural protein activity compared to natural activity. A model that can accurately predict non-natural activity is valuable because protein engineers typically pursue activities which nature has not yet optimized.

We attempt to automate optimal protein identification

We define a novel method to efficiently search through spaces of large numbers of variants using a protein’s graph representation. We believe that this approach could enable automated discovery of optimal proteins for a given function without having to explore every possible variant, which is computationally expensive. The model should deduce which structural patterns result in higher fitness to find the optimal protein.

## Methods

How do we test our models?

We tested our approaches to find the optimal protein amongst all ~6000 single variants ⎼ variations of a protein with a mutation at a single amino acid position in the sequence ⎼ of amiE, a gene that encodes an enzyme from P. aeruginosa [4]. Wrenbeck et. al quantified the fitnesses of these variants (cell growth, in this case) when performing activity on three substrates: acetamide (ACT), propionamide (PR), and isobutyramide (IB). Activity on ACT and PR is considered more natural, as they have structures for which amiE has been naturally evolved, than comparatively unnatural activity on IB, which has a fundamentally different structure [4].

How can we encode protein structure into a graph?

We used Triad, a physics-based protein modeling software, to predict protein structure from sequence. We then used the predicted structure to generate its graph representation that describes amino-acid contacts in the protein. We applied a language model, a model architecture that typically learns from text (in this case, protein sequence) [3], to extract amino acid-level features used to describe the nodes of the graph representation. Figure 2 demonstrates the construction of the protein’s graph representation, where structure and sequence information are used in tandem.

Can structure-based representations improve fitness predictions?

We trained a GCN to learn and subsequently predict protein fitness from its graph representation, as illustrated in Figure 3. This model takes in a graph as an input that represents some protein then passes it to a series of convolutions. These convolutions uncover potentially valuable hidden information about the protein by increasing the complexity of the graph representation via message passing, a framework in which node information is passed to connecting nodes at each step. In this context, sequence information about each amino acid is used to update the information describing other amino acids. The GCN produces a new, higher-level graph representation where each node not only contains information about itself, but also about its neighbors. This is followed by a pooling layer and an activation function that reduce the dimensionality of the graph to a single fitness score. The model ‘learns’ to output an accurate fitness score by learning which node connections to assign more priority to in the convolutions that result in more accurate prediction.

How can one efficiently navigate a protein fitness landscape using structure?

We defined a method to efficiently find the optimal protein rather than predicting the fitnesses of all candidates, which is computationally expensive. To do so, we only predict the fitness of a smaller subset of the variants: we optimize the graph structure to retrieve the one that has the highest predicted fitness. We modified a protein’s graph representation to find a new one in the “landscape” of all potential protein variants that is likely to increase predicted fitness until the protein with the best fitness is found (Figure 4) [5].

## Results

Structure-informed representations improve fitness predictions

Our hypothesis was confirmed by comparing the performance of structure-based and sequence-based fitness prediction models. Performance here is measured by the correlation between true and predicted fitnesses: high correlation (above 0.5) means that the model can accurately identify high-functioning and low-functioning variants. Representations that utilize structure yield higher performance than those that solely rely on sequence (Figure 5). This demonstrates that structure-informed protein fitness prediction models are more accurate than the baseline sequence-based approach. This confirms that protein structure is more informative than sequence.

Substrate identity results in independent fitness predictions

We show that enzyme activity is substrate-specific as the fitness prediction model performs poorly at predicting the activity on different substrates other than the one that is used for model training. As Figure 6 shows, the model that performs best at predicting the activity on a given substrate is the model that is trained on activity on that substrate. As fitness predictions on one substrate cannot be transferred to another, enzyme activity must be substrate-specific.

Furthermore, for the model that performs best at predicting activity on a given substrate, our approach is best at predicting non-natural activity on IB rather than natural activity on ACT and PR. Thus, our results imply that our model can best predict non-natural activity. We suggest that this is due to the fact that IB has a wider range of fitness values across variants compared to activity on ACT and PR; a wider range of fitness values means that the model can learn more structure-fitness patterns whereas a small range limits what the model can learn.

Graph optimization can identify optimal proteins in small fitness landscapes

We designed a method for finding proteins of optimal fitness by optimizing and repeatedly updating the graph representation of the protein. We found that in a small landscape of 30 variants, the model could successfully identify the variant with the optimal fitness. Figure 7 shows the updated representation’s similarity to the optimal variant throughout training: ideally, the model would produce a variant that is identical to the optimal variant, with similarity increasing as training round increases. As shown in Figure 7 (left) the model navigates to the optimal protein at around the 35th training round, where the similarity oscillates between the variant the model deems to have ideal fitness and another variant in the landscape. This pattern signifies that the model has already found a protein that it deems to be optimal and can no longer navigate the landscape to find new variants.

However, in a larger landscape of 600 variants, the model finds a protein with higher fitness, but not the variant with 100% similarity to the optimal, as shown in Figure 7 (right). This is likely because the model finds a protein with maximum fitness relative to proteins nearby in sequence space rather than relative to the entire landscape. Figure 8 shows the distribution of fitnesses across sequence space, where there are multiple local maxima in fitness scattered throughout the landscape that the model can incorrectly identify as the optimal protein.

## Discussion

Structure-based fitness prediction improves over a sequence-based approach, making it a promising fitness prediction tool that is more accurate for finding optimal proteins for a given function. However, there are limitations: it would be computationally expensive to compute the structure for every variant rather than simply using sequence for fitness prediction depending on the number of variants. The user must assess the number of variants that will be used as it may be of less interest to use a graph-based approach for a larger number of variants.

We found that the structure-based model performs considerably better at predicting non-natural activity than natural activity. This implies that the model can be utilized in predicting the fitnesses of potentially valuable non-natural proteins that can be created via protein engineering.

This work has only been tested on a single-mutant fitness landscape, where there are only small changes in structure. However, it may be of interest to search for the best protein through variants with multiple mutations, where there are larger changes in structure. If there are larger changes in structure, the model may have more success at differentiating between structures in producing fitness predictions. Unfortunately, landscapes with multiple simultaneous mutations are largely populated by variants that exhibit little to no activity (extremely low fitness), making navigation a more difficult task.

Fitness landscape navigation is of interest as it would allow one to automatically find optimal proteins for a given function without lab experimentation and without calculating every candidate’s fitness. This is a novel approach in machine learning for fitness prediction because most related approaches only focus on predicting a given protein’s fitness. However, navigating diverse landscapes with different topologies may pose challenges. Our landscape navigation method aims to incrementally find proteins with higher fitness until it finds the optimal protein (in other words, the highest peak in the landscape) as shown in Figure 9. The navigation model proposed here incrementally finds better proteins until it has found a peak which it assumes is the optimal protein, but there is no way of guaranteeing that it is the highest peak (the true optimal protein) without exploring every point in the landscape. This makes the model successful in finding the optimal protein in a landscape with only one peak (Figure 9, left). However, if there are multiple peaks, the model may find a peak that is not the highest and incorrectly assume that it is the optimal protein (Figure 9, right).

Our fitness landscape navigation model can successfully find the optimal protein in a small landscape, but it was unsuccessful in a larger landscape likely because it is more complex and has multiple peaks. Future work to make this navigation robust to diverse landscapes will give researchers freedom to find any optimal protein, no matter how smooth or rugged the landscape is.

## Conclusion

In summary, we first found that protein structure is more informative than its sequence for protein fitness prediction. We also found that non-natural activity is more predictable than natural activity. Third, we showed that a fitness prediction model trained on a specific substrate does best at predicting activity on that substrate. Finally, the fitness landscape navigation model proposed here to automatically find the best protein for a given task succeeds in smaller landscapes but struggles in larger, complex landscapes.

It is worth exploring this approach further to enable successful navigation of larger, diverse fitness landscapes: for example, landscapes of variants that have multiple mutations. Navigating through variants with multiple simultaneous mutations is likely to result in barren landscapes with few peaks; it may be of interest to define a method to navigate through such landscapes. A potential approach would begin several navigation searches at different points in sequence space, with the hope that the model will find different peaks.

This work is a promising step in informed protein engineering, which will enable better protein identification needed for impactful technological, scientific, and medical applications.

Exploring protein fitness landscapes by directed evolution

Machine-learning guided directed evolution for protein engineering

Structure-based protein function prediction using graph convolutional networks

## References

1. Yang, K. K., Wu, Z. & Arnold, F. H. (2019) Machine-learning-guided directed evolution for protein engineering. Nat Methods 16, 687–694. DOI: 10.1038/s41592-019-0496-6
2. Romero, P., Arnold, F. (2009) Exploring protein fitness landscapes by directed evolution. Nat Rev Mol Cell Biol 10, 866–876 DOI: 10.1038/nrm2805
3. Gligorijević, V., Renfrew, P.D., Kosciolek, T. et al. (2021) Structure-based protein function prediction using graph convolutional networks. Nat Commun 12, 3168. DOI: 10.1038/s41467-021-23303-9
4. Wrenbeck, E., Azouz, L. & Whitehead, T. (2017) Single-mutation fitness landscapes for an enzyme on multiple substrates reveal specificity is globally encoded. Nat Comm 8, 15695. DOI: 10.1038/ncomms15695
5. Li, J., Tianyun Z., Hao T., Shengmin J., Makan F., and Reza Z.  (2020) SGCN: A Graph Sparsifier Based on Graph Convolutional Networks. Advances in Knowledge Discovery and Data Mining 12084: 275.

## Acknowledgments

I would like to thank my co-mentor, Kadina, for being supportive of my ideas, patient with my work, and for giving me valuable insights on my project; I would also like to thank my mentor, Dr. Frances Arnold, for giving me the opportunity to work outside of my discipline in chemical engineering research, and Dr. Sabine Brinkmann-Chen for taking the time to give helpful feedback on my reports and presentations. I thank my principal investigator from the DYNAMO Lab at UCSB, Dr. Ambuj Singh, for inspiring me to pursue graph machine learning. Finally, I send my thanks to the Resnick Sustainability Institute (RSI) and Student-Faculty Programs (SFP) at Caltech for providing me with financial support on my project.

## Interview with Professor Eric Mazumdar

Interviewer: Hannah X. Chen

### Could you start by sharing a brief overview of your research and its applications?

I work at the intersection of machine learning and economics, and the motivation is really thinking about how we use machine learning algorithms in the real world, and what happens when we actually use them. In the real world, algorithms have to interact not only with people, but potentially also with other algorithms, and they have to deal with situations that we’ve historically seen machine learning algorithms struggle with. If you think back to applications of machine learning in the real world, there’s all these famous failures. There’s self-driving cars that crash because they can’t tell the sky from a truck, or that won’t cross an intersection because they can’t figure out the intent of other agents that might be present. There’s also all types of problems that happen in e-commerce markets, in social networks, and so on. And you can really think of these issues as arising because learning algorithms are coming into contact with people and other algorithms.

A big part of my research is just trying to understand what can go wrong when we use algorithms in environments containing real people who might have their own objectives. We’re also thinking about how we would design algorithms specifically for these environments so that they can have provable guarantees or at least economically meaningful measures of performance. My main application areas are in transport systems (ride-sharing platforms are a major example of this), e-commerce platforms, social networks, the delivery of healthcare — all of which have this kind of flavor of algorithms meeting people and interacting with other algorithms.

### What do you see as the short-term or long-term directions of your work and the field as a whole? What is the end goal you would eventually like to reach?

The kind of pipe dream is this idea that we can develop algorithms that actually have a beneficial impact in society. And what I mean by that is having algorithms that don’t make things worse when they’re actually deployed in the environment — instead, they actively improve things. Maybe if it’s in transport systems, it improves congestion in cities. Maybe if it’s on e-commerce platforms, it lowers prices and makes the market more equitable. But doing this while keeping in mind issues of fairness and biases in data is really hard. But I think it’s a really important problem, because people aren’t going to stop using algorithms in the real world, and we really need to start figuring out how we do this properly, fairly, and ethically. That’s a big part of what my research is trying to do.

### Is there anything you’re really excited about in your work right now, such as a current project or new idea?

Theoretically, we were able to show that one of the most important aspects when you’re using learning in the real world is how fast you’re learning. Often, we think that learning fast is always better, but when we started applying these types of ideas in games — where you’re learning against an opponent who’s also learning — we found it might not be better. And actually, the speed at which you learn changes the equilibria you might find, but different equilibria have different properties. Some might be more socially optimal, some might be better for individuals, and so on. So all of a sudden, we now have this way of achieving different solutions by changing how fast we learn.

Something I’m really excited about is that this is a phenomenon we’re seeing in real data from pricing on e-commerce platforms, for example. We see that people change how fast they update the prices relative to their competitors to get an advantage. Sometimes people update their prices very very slowly, and then they’re actually better off and are able to get more of the market. In other cases, they update much more quickly and can be better off as well. So there’s this interesting degree of freedom, which I don’t think has been explored much. I’m very interested in exploring the dynamics of learning algorithms in these game theoretic settings, which I think will be a really exciting area in the next couple of years.

### What are the biggest challenges that come with performing this research? What are the things you have to do to get your work moving forward?

A lot of my work is more on the theory side, mostly trying to develop algorithms that satisfy some theoretical guarantees and seeing that they work in practice. There’s two main challenges: on the data side, a lot of companies and entities that use algorithms in the real world don’t want to share their data because they don’t want to show that they’re doing something that isn’t working, or they’re scared of revealing some sort of inherent bias in their data. On the theoretical side, you have to bring in ideas from a lot of different fields to try and understand these problems. You need ideas from game theory and economics to deal with these issues of multiple agents all interacting, but then you need ideas from control and dynamical systems to think about the dynamics that are arising in these situations. And you need to bring in ideas from computer science to deal with the algorithmic parts. It’s really a research agenda that’s at this intersection of three different domains, so there are natural challenges regarding how they interact and whether you can make these things play nicely. It’s a really cross-disciplinary research agenda, and that comes with challenges, but also a lot of opportunities.

### What has it been like working in both the CMS and economics departments at Caltech?

I only recently started, but it’s been a very satisfying experience because from what I’ve seen, economics and the computing sciences are very tightly integrated at Caltech. The economists are familiar with a lot of the tools and ideas from computer science and engineering, and the people in engineering also have some connections with economics, so it’s been really interesting to work at this interaction and play both sides. In my view, Caltech is extremely well positioned in this new research area that’s emerging at the intersection of machine learning and economics. We have a very strong mathematical microeconomics core, along with very strong experimental economists that understand decision making, whose ideas we can then embed into our algorithms. On the other hand, it has a very strong computer science and mathematical sciences department. So it’s been a really fun, unique opportunity, and part of that is only possible because of the small size of Caltech, which makes it easy to work between divisions and departments.

### What drew you to this area to begin with? How did you get started in either machine learning or economics?

The way I got into computer science, machine learning, and economics is a bit weird. I basically started by getting my hands on data. We were looking at data about how people park their cars in Seattle, and there’s this idea that a large part of the congestion in Seattle is caused by people just looking for parking. So the question was whether we could come up with a model for this and see if this is true. Something very natural that emerged from this problem is to model parking as a game between people: those trying to drive on the city streets and those looking for parking spots. We started doing that, and that got me thinking a little bit more about how we can even use learning algorithms in game theoretic settings. From there, I kind of moved more and more towards machine learning and game theory, and then economics more broadly. So it was not an immediate process — I didn’t go into grad school knowing that this was what I wanted to do, but it was a very natural thing because this research area allowed me to draw on ideas from game theory, which I thought was fascinating, but also apply tools from machine learning, dynamical systems, and control, all in the context of interesting, real-world problems.

### What were your mentors like as you were working in multiple fields? What has your experience been as a mentor to students of your own?

What may be surprising is that in my undergrad, I was actually doing computational biology. I had a set of really good mentors who allowed me to explore my interests in that field and help me discover that the thing that excited me most in computational biology was actually mathematical modeling and coming up with dynamic models of how, for example, drugs propagate through biological systems. When I got to Berkeley for grad school, I again had mentors who could really get me thinking about big open questions and the emerging themes and questions that were arising as machine learning and algorithms are deployed in the real world. So I’ve been extremely lucky throughout.

Here at Caltech, there’s two main parts of mentorship for me. The first is giving people the freedom to be creative and come up with their own ideas while giving them guidance through high-level research goals and things like this. The second thing is mental health. I’ve had a lot of close friends struggle with mental health in grad school and undergrad, so for me, a big part of mentorship is making sure the person feels very supported on the day-to-day and has something to fall back on. A lot of these environments, grad school and even undergrad at such an academically rigorous place, can be very high-pressure, so you need to have strong support networks for students.

### What are some of your favorite parts about being a researcher and professor?

The best part of doing research is being able to talk to people and all of the random conversations you have in hallways or in front of white boards. It’s also just interacting with students and people with such different backgrounds and interests. That’s kind of the standard answer, but it’s really the most interesting part for me, especially at Caltech where you can meet so many people who do so many different things. You can go from a conversation about fluid mechanics to game theory and auctions in the same hour. It’s all very exciting.

### Do you have any advice for students who want to get into research, especially within your field?

As I mentioned, I was in bioengineering when I started undergrad, and I ended up in EECS (electrical engineering and computer science); now, I’ve moved towards economics. I think being multidisciplinary and having varied interests is a good thing. Interesting things happen at the intersection of fields, so don’t be scared of trying to make connections between fields.

Getting involved at Caltech, I think, is really a question of persistence. If professors aren’t answering (and I’m definitely guilty of this, too), it’s not because they don’t want to — it’s just that they get too many emails. We want to work with students at Caltech since you’re all so good, so it’s very important to not be afraid to just send the emails and knock on doors. I was definitely worried when I sent my first few emails looking for positions, but in the end it’s just biting the bullet and doing it.

## Interview with Professor Frances Arnold

Interviewer: Maggie Sui

### Biography

Professor Frances Arnold is the Linus Pauling Professor of Chemical Engineering, Bioengineering and Biochemistry at Caltech and a 2018 Nobel Laureate in Chemistry. Her research focuses on using directed evolution to create new enzyme function. Prof. Arnold received a B.S. in Mechanical and Aerospace Engineering from Princeton University and received a Ph.D. in Chemical Engineering from the University of Berkeley.

### Trivia

Favorite book:  Too many to choose from.  I am now enjoying Ai Wei Wei’s memoir

Favorite Place:  Planet Earth

Favorite Food:  oysters

Favorite Protein molecule:  cytochrome P450 BM3

### What type of research does your lab focus on?

We breed molecules like you can breed cats and dogs, selecting for the traits that are useful to us in a process that mimics evolution by artificial selection.  The ‘directed evolution’ methods we pioneered to engineer better enzymes are now used all over the world for everything from laundry detergents, cosmetics, treating disease to making jet fuel from renewable resources.  I envision a world where nature’s clean and efficient chemistry replaces dirty human manufacturing processes.  Wouldn’t it be wonderful to make our pharmaceuticals, cosmetics, fuels and chemicals like we make beer? Wouldn’t it be wonderful to have microbes eat our plastic wastes and convert them to valuable new materials?  That’s all possible with enzymes.  But someone has to make them do what humans want rather than what they do in nature. That’s what we do.

### What aspects of research do you most enjoy?

I love discovering new things that enzymes can do.  Sharing the joy of discovery with others is almost as much fun. And since my students are always discovering new things, it’s a blast to go to work.

### What influenced you to pursue science? Did you have any role models growing up?

My father was a nuclear physicist and also did a lot of engineering design. I wanted to be like him when I grew up, but I didn’t love nuclear physics. I tried a few things before I found protein engineering, at the beginning of the DNA revolution.

### What was the origin for your directed evolution idea? When did you first think of it?

When no existing design approaches worked to make better enzymes, I was getting a bit desperate. I decided to try lots of mutations, easily done randomly using error-prone polymerase chain reaction (PCR) and testing by screening in plates. I was thrilled when we found beneficial mutations in surprising places, where no one would have predicted. We accumulated those mutations in sequential rounds, and sometimes recombined them, just like breeding cats or dogs over multiple generations. That worked beautifully, and evolution became the only reliable method for generating better enzymes.  It was an instant ‘hit’ with people in industry, but took much longer to gain the respect of scientists, who always want to understand why. To understand why, however, it’s much nicer to start with the correct answer and work backwards. Directed evolution gave the correct answer.

### How did you become the Nobel-prize winning scientist you are today?

I don’t really know for sure, but it probably has something to do with Caltech. I came here naïve and with limited training, but a good idea.  I was able to implement my ideas here, and formulate even better ones with the help of critical colleagues and brilliant students and postdocs. I never stopped to worry whether it would be possible to make better enzymes or whether I would be able to do it. Nor did my students.

### What is your work in Washington like?

Nobel Laureates are always warned not to get involved in things we know nothing about.  I threw that advice to the wind when I took on co-chairing the President’s Council of Advisors on Science and Technology (PCAST) in January, 2021.  I find it fascinating.  And I have already learned that science is easy compared to dealing with…people.  People are really complicated. We can have all the right science, but if people don’t trust it or won’t use it for other reasons, it goes nowhere. In this job, I am also seeing many fascinating new science problems and solutions.

I see this work (which is unpaid and takes a lot of time, by the way) as a way to pay back, to make sure that future scientists will have the same wonderful opportunities that I enjoyed.

### What are some of the challenges you’ve recently faced? How do you overcome them?

We all face challenges: loss of loved ones, illness, failure.  If you have lived as long as I have, you will experience all of those, perhaps multiple times.  I find that the best way to face life is to be grateful for what I have, which is so much. I try not to dwell on what I have lost, nor do I feel sorry for myself. I am grateful to still be healthy and able to do something useful.

### What are your interests outside of science?

I love music, gardening, hiking, meals and conversation with friends. I used to love travel, but I haven’t been many places lately…

### What advice would you give to Caltech students and other young students interested in research?

Do it because you love it.  All else will follow.

## Evolution of Major Element Chemistry in Protoplanetary Disks and Its Connection to Planetesimal Growth

###### Predicting the habitability of planets is closely tied to the amount of water, but understanding how water was delivered to the Earth is not well understood. Currently, we know there is a connection between the bulk composition of Earth, essentially the Earth’s major composing elements, and the total water content. A key component in explaining this connection is chondrites, which are early unaltered meteorites from 4.5 billion years ago. Chondrites can help explain this connection by giving us insight into the thermochemical environment during the early stages of planetary formation. By comparing current chondrite samples to the results of model, we can better understand the mechanisms that allowed terrestrial to form in the inner disk and how their bulk compositions evolved.

Author: Rohan Iyer
California Institute of Technology
Mentor: Yoshinori Miyazaki
Department of Computing & Mathematical Sciences, Caltech
Editor: Agnim Agarwal

## Abstract

Understanding the processes that could produce bulk compositions similar to Earth’s is crucial in constraining how and where planets formed and grew in the early Solar System. A key component in understand these processes arise through analyze the evolution of chondrites. Chondrites are early unaltered meteorites from 4.5 billion years ago and they are likely candidates for the building blocks of Earth. In particular, the chemical composition of chondrites has a puzzling characteristic as chondrites exhibit surprising depletions in refractory elements including Al, Ca, and Mg. However, we still lack an astrophysical model to bridge the gap between the early stage of protoplanetary disk evolution, dense solar gas and dust rotating around a young star, and the formation of planetesimals, growing planet cores. The formation of planetesimals within the evolution of the protoplanetary disk still remains unclear; To this end, we built a self-consistent model solving for the disk evolution from the dust condensation to the planetesimal growth stage. The key mechanisms for planetesimal formation considered in the study are pebble accretion, streaming instability, and the transition of magnetorotational instability (MRI) modeled through $\alpha$. In our model, we consider the effects of different $\alpha$ values due to MRI as well as dust to gas ratio thresholds and accretion timescales that onset streaming instability. Our results demonstrate that the turbulence from an $\alpha$ transition from MRI to $10^{-4}$ in conjunction with streaming instability and pebble accretion creates lower pressure regions in which dust and pebbles accumulate within the terrestrial region. However, our model doesn’t account for two chemically distinct Mg-rich reservoirs which are necessary for explaining why Earth and enstatite chondrites share similar isotopic composition but different bulk compositions. We anticipate that our model will be used to constrain the viable parameter space for the formation of planetesimals as well as be a motivation to explore the mechanisms that produce a greater depletion in refractory elements.

## 1. Introduction

Bridging the gap between the chemical and astrophysical mechanisms is still largely debated. A key aspect to understanding this disconnect is by looking at chondrites. Chondrites provide a connection that can be used to constrain the building blocks of Earth and the thermochemical environment during early protoplanetary disk formation. One of the challenges in understanding the bulk composition is the lack of understanding in how the elemental compositions of chondrites evolve. Early planetary formation is supported by condensation theory which suggests that planets condensed from gas in the early nebula. As the solar gas dissipated over time, the protoplanetary disk cooled which suggests that volatile elements are expected to be depleted. However, chondrite samples show a surprising depletion in refractory elements like Ca, Mg, and Al (Wasson and Kallemeyn, 1988). One consideration for the depletion of refractory materials is that they were delivered to the inner disk because of their condensation temperatures. Refractory materials have higher condensation temperatures than volatiles, ~1500-1800 K for Al and Ca rich solids. Thus, it is possible that the refractory solids were able to diffuse inwards and sustain the higher temperatures of the inner-disk, resulting in the larger depletion of refractory material than volatile material.

When considering the astrophysical dynamics, the mechanisms that create terrestrial planets are still controversial. Accretion processes often consider two main objects: pebbles and planetesimals. Pebbles are millimeter to centimeter sized bodies, while planetesimals are kilometer sized objects. A promising mechanism in which the orbiting solids gravitationally collapse onto the growing planetesimal is streaming instability, which occurs when the dust to gas ratio or pebble to gas ratio surpasses a threshold. Once planetesimals form, they have been theorized to grow efficiently through pebble accretion. A key aspect of pebbles is in the way they interact with the gas flow. Pebbles are marginally coupled to the gas flow which facilitates planetesimals accreting material from the entire gravitational reach of the planet. The headwind experienced by the dust and pebbles particles allows them to lose angular momentum and accrete efficiently. This allows planetesimals to dissipate their kinetic energy and become gravitational bound to nearby large bodies, yielding planetesimal growth rates much higher than previous accretion rates considering planetesimal accretion (Johansen & Lambrechts 2012).

These mechanisms were suggested for gas giants like Uranus and Neptune, and so our model considers how streaming instability and pebble accretion characterize the inner-disk evolution of terrestrial planets. In particular, we consider the effects of streaming instability, pebble accretion, and MRI simultaneously which hasn’t been examined in previous studies. As such, we are looking to test different thresholds to onset streaming instability and see if they yield terrestrial planets consistent with our observations.

In our model, we consider both condensation theory and pebble accretion to investigate how and where planetesimals form as well as the loss of minerals in the inner region. The thermochemical evolution of the inner disk has been discussed in previous studies (Pignatale et al., 2018; Miyazaki and Korenaga, 2021), but previous disk models haven’t considered the major element chemistry in conjunction with astrophysical phenomena like MRI and pebble accretion to track the evolution from dust to planetesimal. Our model records the specific major element composition (Al, Ca, and Mg) over the entire protoplanetary disk evolution ($\sim$ 1 Myr) from dust to planetesimal while incorporating these astrophysical principles. This allows us to constrain the astrophysical and thermochemical parameters and compare the final elemental compositions of our model to observations in chondrite samples. A key component of this model is implementing $\alpha$ values (measure of turbulent viscosity) considering the changes in turbulence within the inner disk. A key mechanism which onsets changes in turbulence is MRI. MRI works to create macroscopic turbulence and transport angular momentum, and the MRI transition has been suggested as a way to create a seed for terrestrial planets (Morbidelli et al., 2020). Therefore, an $\alpha$ transition due to the onset of MRI could be a possible mechanism for planetesimal formation.

To understand the conditions that enable streaming instability, we first considered the early stage of disk evolution: transport of gas, dust, and pebbles, before planetesimals are formed. The results showed a sizable dust depletion inside 3 AU where much of the dust was being converted to pebbles but not enough of an enrichment in pebbles to surpass the streaming instability threshold. We implement an $\alpha$ transition due to the onset of MRI that directly changes based on the mid-disk temperature so higher temperatures result in stronger turbulence. Our results demonstrate that a significant $\alpha$ value change in the terrestrial region creates low pressure regions in which dust accumulates and eventually grows into planetesimals.

## 2. Model

Our model solves for radial motion of gas, dust, pebbles and planetesimals within the protoplanetary disk. The motion of gas is described by the diffusion equation:

For solid materials, we solve for the dynamics of the three elements (Al, Ca, and Mg) using the advection-diffusion equation. The first term on the right hand side describes advection, and the second term is dissipation based on the concentration gradient, where $\Sigma_i$ denotes the surface density of species i, t is time, r is the distance from the Sun, $v_i$ is the advection velocity, $\nu$ is viscosity, and $\Sigma$ is the total surface density.

For each element, we consider the evolution from dust to pebble to planetesimal while solving for each stage’s dynamics separately according to Eq. 1. Pebbles are mm-cm sized bodies formed through the collisions of the dust particles. Relative to dust particles, pebbles travel at higher advection velocities and accrete onto growing cores quickly because of their interaction with the gas flow. Pebbles are especially important because of their ability to deliver minerals to the inner regions of the disk as a result of their larger size and drift velocities. The formation of pebbles is described by the dust to pebble formation timescale, $\tau_{d\rightarrow p}$, and the Stokes’ number, $St_p$. Additionally, the evolution from pebble to planetesimal is similarly solved for but characterized by the streaming instability condition, the solid to gas ratio. The streaming instability condition is especially significant because it’s the main mechanism in the formation of planetesimals. In our study, we consider various timescales and pebble grain sizes in order to determine the most likely condition for the onset of streaming instability. We vary pebble to planetesimal timescales and streaming instability conditions in order to investigate how these parameters affect regions of elemental depletion as well as planetesimal formation and composition.

### 2.1. Disk Model

The model we built tracks the density and composition of solid particles of different sizes (dust, pebble, and planetesimal), as well as the temporal and spatial evolution of gas, temperature, and pressure. Our model solves for the evolution of the disk, spanning from 0.05 to 30 AU with 600 spatial intervals, where the dynamics, thermal structure and composition are self-consistently solved at a time step of 0.1 years. Initial chemical characteristics are defined by the elemental ratios for Al, Ca, and Mg which are .0391, .0372, .9236, respectively. Along with this, we define astrophysical constraints for the planetesimal as having a diameter of 200 km and a material density of 3000 kg/m$^3$.

### 2.2. Thermal Structure

The thermal structure of the disk is solved using energy balance between blackbody radiation and viscous dissipation:

where $\sigma_b$ is the Stefan-Boltzmann constant, $T_e$ is the effective temperature, and $\Omega_k$ is the local Keplerian angular velocity. The mid-disk temperature, $T_{mid}$, is then calculated using the effective temperature:

where $\tau = \frac{1}{2}\kappa \Sigma_d$ and $\kappa$ is opacity of the dust.

### 2.3. Dynamics

For each species i in the gas and dust components, its surface density
$\Sigma_i$ is solved using a 1D radial disk evolution model.
The advection velocity $v_i$ is calculated separately for the gas and dust
components. The gas component is calculated using,

and for the dust component,

where $St$ is the normalized stopping time, $\rho_{g,0}$ is the gas density at the disk midplane, and $P$ is pressure. The pressure at the mid plane is a function of the surface density,

where $c_s$ is the sound velocity. The calculated Stokes Number is given by:

where Q* is the specific energy of fragmentation in which we adopt a value of $1m^2/s^2$, and $\alpha$ is a measure of the turbulence. We consider both constant Stokes numbers and Stokes number distributions calculated from Eq. 7 as well various $\alpha$ distributions in our model.

### 2.4. Alpha

While cause and magnitude of turbulence in the protoplanetary disk is still an active field of research, in the early stage of a protoplanetary disk, both magneto-rotational instability and pure hydrodynamic turbulence are considered as sources of turbulent viscosity (Bai and Stone 2013; Nelson et al. 2013; Stoll and Kley 2014). MRI is caused by the effective magnetic field from moving electrons in fluids and the resulting Lorentz force from that magnetic field. These forces have a destabilizing effect on the angular velocity of the fluid. Many previous mechanisms for fluid turbulence have been explored including shear flow turbulence, but now the transition of $\alpha$ due to MRI has been suggested as a way to create a seed for terrestrial planets. The macroscopic turbulence is characterized by $\alpha$ in Eq. 4. Previous studies either consider a constant $\alpha$ value or a grid of $\alpha$ values including $10^{-4}$, $10^{-3}$, $10^{-2}$. We simulate the change in $\alpha$ due to MRI with a temperature dependent $\alpha$ distribution in which $\alpha$ directly changes based on the mid-disk temperature. Therefore, higher temperatures result in stronger turbulence, especially in the inner disk. This allows the turbulence to evolve as temperature changes over time and better reflect the turbulent viscosity of the inner disk.

### 2.5. Streaming Instability & Pebble Accretion

For each element, the growth from pebble to planetesimal is modelled using pebble accretion. The mass accretion rate is defined as:

where $R_{acc}$ is the accretion radius, and $\Sigma_p$ and $\rho_p$ are the pebble column density and mid-plane density, respectively. The accretion rate of pebbles is closely tied to the friction time of pebbles which is described by:

Since the focus of our model is the early stage of growth, we adopt the follow equation for the accretion radius ($R_{acc}$) using the Bondi Limit:

(Johansen & Lambrecths 2017). For simplicity, we assume that planetesimals will be 200 km in diameter and have a material density of 3000 kg/m$^3$. The formation of planetesimals is triggered by the streaming instability condition. Pebbles are marginally coupled to the gas so radial drift allows pebbles collide and grow larger. As a result, they can gravitationally collapse onto the core if the dust to gas or pebble to gas ratio becomes too heavy. The theoretical value for the streaming instability condition has been suggested to be characterized by a dust to gas ratio > 1. In our model, we consider values of $10^{-1}$, $5 *10^{-2}$, $10^{-2}$ as plausible values for the streaming instability condition. Note that we label the the change in $\alpha$ from $10^{-4}$ to constant $\alpha$ and $10^{-3}$ to constant $\alpha$, MRI 4 and MRI 3 respectively, and we will use these naming conventions throughout the figures.

## 3. Results

Our model tracks the evolution of the major element chemistry through dust to planetesimal. Throughout the evolution of the disk, the temperature profile of the disk cools as the nebular gas cools and dissipates over time. Gas and dust from the outer disk through advection and diffusion travel towards the inner disk in which dust and pebble particles can accumulate and grow over time (Figure 2). This model builds upon previous models but incorporates key ideas like non-uniform turbulence due to MRI and pebble accretion. The implementation of turbulence due to MRI points to new considerations for the formation of planetesimals in the terrestrial region. The turbulence within the disk has a significant impact on the dust accumulation and the streaming instability condition.

The dust accumulation can be attributed to the pressure within the inner disk (Figure 3). For larger differences in $\alpha$, there exist more low pressure regions within the disk that dust can travel into. These low pressure regions enable dust to form into pebbles and eventually planetesimals as dust flows from high pressure regions into these low pressure pockets. Additionally, the planetesimals formed must fit within reasonable constraints of terrestrial planets, particularly its mass. Terrestrial planets within the first 0.1 Myr should be on the order of $10^{24}$ kg and especially when considering the bulk composition evolution of Earth, the mass should be close to Earths. Two different $\alpha$-values are tested for the outer region ($10^{-3}$ and $10^{-4}$), and its effect on the formation of planetesimals is illustrated in Figure 4.

Another parameter which impacts the growth of the planetesimals is the Stokes number. In this figure, we consider $St_p$ of 5, 10, and 20 times the $St_d$. Terrestrial planets exist within a specific mass range, typically < $10^{24}$ kg. Comparing the Stokes number further helps us constrain the viable parameter space. Looking at Figure 5, there is a positive correlation between the Stokes Number and the planetesimals mass. In the first 100,000 years of evolution, planetesimals with masses ~ $10^{25}$ kg would not explain the evolution of Solar System as terrestrial planets have masses on the order of $10^{23}$ to $10^{24}$ kg.

Finally, we can constrain the elemental compositions and thermal structure of the inner disk. Chondrites are likely the building blocks of Earth but not one specific type of chondrite matches the Earth in terms of chemical and isotopic compositions. Earth is suggested to form from a Mg-rich reservoir, which existed just adjacent to the source region of enstatite chondrites, depleted in Al, Ca, and Mg. In order for the fractionation between these elements to occur, the temperature profile inside 1AU needs to be > 1000K which places further constraints on our parameter space. Incorporating a timescale, $\tau_{d\rightarrow p}$ < 10,000 years would make the inner disk too cold which would increase the amount of ice existing around the terrestrial region. For a timescale of 100,000 years, more dust remains to support a high temperature inside 1AU. However, we don’t see enough of a depletion in Al and Ca to support having two chemically distinct Mg rich reservoirs (Figure 6).

## 4. Conclusion

By solving advection-diffusion equations in a self-consistent manner, our model considers pebble accretion and magnetorotational instability to simulate conditions in which streaming instability can occur. Our model suggests that creating planetesimals requires a significant change in turbulence which is measured by the dimensional parameter $\alpha$ in order to create pressure pockets for dust and pebble accumulation. In our model, a transition from turbulence due to magnetorotational instability to a constant value of $10^{-4}$, MRI 4, is sufficient in creating planetesimals in the terrestrial region. Additionally, the total mass of the planetesimals heavily depends on the Stokes Number in which a pebble stokes number equal to 5 times the dust’s Stokes Numbers yielded feasible masses for possible terrestrial planets. Additionally, the timescale $\tau_{d\rightarrow p}$ affects the streaming instability condition and temperature profile. For $\tau_{d\rightarrow p} = 10,000$ years and $\tau_{d\rightarrow p} = 100,000$ years, the model allows for a higher SI condition to be onset but also depletes dust too quickly which in turn results in heat dissipating too quickly as well. While the transition of $\alpha$ from MRI to $10^{-4}$ may trigger the streaming instability in the terrestrial region, the resulting compositions do not seem to explain the Earth nor enstatite chondrites in terms of the elemental compositions as we do not see enough depletion in Al and Ca. However, in a broader sense, our model works to connect two previously distinct areas of research, the minute chemical interactions of gas and dust within the early protoplanterary disk and N-body dynamics which are simulations once planets have already formed. While our model may not currently explain the formation of Earth, it’s possible that the parameters we considered could explain the composition of other exoplanet systems and that our model show key advancements in the evolution of the protoplanterary disk as well as in bridging the gap between cosmochemical observations and astrophysical dynamics.

## 5. Future Work

The next step for this model is to expand the viable parameter space. The parameters that were evaluated in this model help constrain viable astrophysical parameters but certain aspects like the depletion in Al and Ca are still not fully explained from our model. In addition, more work can be done to push the bounds of the streaming instability condition closer to the theoretical threshold. Inversely, finding the minimum thresholds to onset SI is an active area of research, so developments on these two fronts will help improve our model. On a grander scale, unifying our model of the early stage of evolution with the cosmochemical observations and later stages of planetary evolution like N-body simulations will give us a more complete depiction of the formation of terrestrial planets and the solar system.

## Acknowledgements

I would like to thank Yoshinori Miyazaki for his continued mentorship throughout our research as well as the Caltech SURF program for supporting this project.

## References

1. X.-N. Bai and Stone J. M. Wind-driven accretion in protoplanetary disks. i. suppression of the magnetorotational instability and launching of the magnetocentrifugal wind. The Astrophysical Journal, 769(1):76, 2013. doi: 10.1088/0004-637x/769/1/76.
2. S.J. Desch, Estrada P.R., A. Kalyaan, and J.N. Cuzzi. Formulas for radial transport in protoplanetary disks. The Astrophysical Journal, 840(2):86, 2017. doi: 10.3847/1538- 4357/aa6bfb.
3. J. Drążkowska and Y Alibert. Planetesimal formation starts at the snow line. Astronomy & Astrophysics, 608, 2017. doi: 10.1051/0004-6361/201731491.
4. Y. Miyazaki and J. Korenaga. Effects of chemistry on vertical dust motion in early protoplanetary disks. The Astrophysical Journal, 849(1):41, 2017. doi: 10.3847/1538- 4357/aa8cd1.
5. Y. Miyazaki and J. Korenaga. Dynamic evolution of major element chemistry in protoplanetary disks and its implications for earth-enstatite chondrite connection. Icarus, 361:114368, 2021. doi: 10.1016/j.icarus.2021.114368.
6. A. Morbidelli, G. Libourel, Jacobson Palme, H., S. A., and D. C Rubie. Subsolar al/si and mg/si ratios of non-carbonaceous chondrites reveal planetesimal formation during early condensation in the protoplanetary disk. Earth and Planetary Science Letters, 538:116220, 2020. doi: 10.1016/j.epsl.2020.116220.
7. F. C. Pignatale, S. Charnoz, M. Chaussidon, and E. Jacquet. Making the planetary material diversity during the early assembling of the solar system. The Astrophysical Journal, 867(2), 2018. doi: 10.3847/2041-8213/aaeb22.
8. M. H. Stoll and W Kley. Vertical shear instability in accretion disc models with radiation transport. Astronomy & Astrophysics, 572, 2014. doi: 10.1051/0004-6361/201424114.
9. J.T. Wasson and G.W. Kallemeyn. Compositions of chondrites. Series A, Mathematical and Physical Sciences, 325(1587):535–544, 1988. doi: 10.1098/rsta.1988.0066.
10. A. Youdin and A Johansen. Protoplanetary disk turbulence driven by the streaming instability: Linear evolution and numerical methods. The Astrophysical Journal, 662 (1):613–626, 2007. doi: 10.1086/516729.

## Dimensionality Reduction as Evidence of a New Regime of Galaxies

###### Currently, researchers believe that there are only two regimes of galaxies, delineated by whether they produce stars or not. However, using a novel machine learning technique to estimate the temperatures of galaxies, we find evidence of a distinct third regime of galaxy evolution. To demonstrate this, we studied roughly 650,000 galaxies in the publicly available COSMOS dataset. Moreover, we identify a group of galaxies that belong to this third regime, providing clear visual evidence of our three-regime model of galaxy evolution.

Author: Patrick Rim
California Institute of Technology
Mentors: Charles L. Steinhardt, Adam Blank
Cosmic Dawn Center, Niels Bohr Institute, California Institute of Technology
Editor: Laura Lewis

## Abstract

Dimensionality reduction is an unsupervised machine learning technique used to discover and visualize patterns in a dataset by reducing its number of dimensions. We show that by applying dimensionality reduction to a dataset of galaxies, we can provide evidence for an updated model of galaxy evolution. The currently accepted model of galaxy evolution defines two regimes of galaxies: star-forming and quiescent. However, using a novel technique called IMF fitting to estimate the temperatures of galaxies, we find that the regime of star-forming galaxies can be subdivided into two distinct regimes. To demonstrate this, we studied roughly 650,000 galaxies in the COSMOS dataset, each of which has flux measurements in twenty-seven photometric bands. First, we preprocessed the data by removing galaxies with incomplete or erroneous measurements and normalized the photometric measurements to a single photometric band. We then applied the t-SNE dimensionality reduction algorithm to the preprocessed dataset, determining appropriate values for t-SNE’s parameters. Observing the resulting plots, we found that t-SNE clustered the galaxies together by regime and that the three clusters representing the three regimes were ordered sequentially on the map. Thus, our study provides evidence of three distinct regimes of galaxy evolution.

## 1. Introduction

The universe is filled with stars, which were all formed at some point in time. Most existing stars were formed in the early stages of the universe, with star formation rates peaking about ten billion years ago [5]. In comparison, galaxies today are not forming stars at nearly the same rate. In fact, star formation rates have dropped to about three percent of that peak [24]. Following such observations, the currently accepted model of galaxy evolution defines two regimes: star-forming and quiescent [12].

We have recently discovered a way to estimate the temperatures of galaxies using a technique called Initial Mass Function (IMF) fitting, described by Sneppen et al. [21]. Studying both the star formation rates and the IMF temperatures of galaxies, we find a small group of star-forming galaxies that have higher temperatures than other star-forming galaxies. Therefore, two distinct groups of star-forming galaxies are observed. This observation challenges our currently accepted model of galaxy evolution, which details just two regimes of galaxies: star-forming and quiescent. Thus, accounting for the newly-discovered group of star-forming galaxies, we propose an updated three-regime model of galaxy evolution.

In this paper, a machine learning technique called “dimensionality reduction” is used to provide experimental evidence of this new third regime of galaxies. In particular, we apply a dimensionality reduction algorithm called “t-SNE” [28], or “t-distributed stochastic neighbor embedding,” to the Cosmic Evolution Survey (COSMOS) dataset of galaxies which contains about one million galaxies and twenty-seven photometric bands for each galaxy [15]. Using dimensionality reduction, we are able to visualize this high-dimensional dataset and correct for numerous issues that arise when analyzing data points in a high-dimensional space. The most important of these issues is that data points are too far apart to be grouped into clusters. [20]. We describe and justify the methods we use to preprocess the COSMOS dataset, such as feature scaling and removing outliers. By preprocessing the dataset, we are able to produce meaningful plots that correctly cluster together physically similar galaxies. We show that the maps produced by t-SNE when applied to COSMOS provide clear visual evidence of our three-regime model of galaxy evolution.

This paper is structured as follows: Section 2 details the currently accepted two-regime model of galaxy evolution and our proposed three-regime model of galaxy evolution. Section 3 describes the COSMOS dataset of galaxies. In Section 4, we introduce and motivate the use of dimensionality reduction, as well as the t-SNE algorithm. In Section 5, we present the maps produced by t-SNE when applied to COSMOS. In Section 6, we summarize and discuss our results.

## 2. Models of Galaxy Evolution

### 2.1 Two-Regime Model

When we observe the galaxies in the Hubble Deep Field, a small region of space surveyed extensively by the Hubble Space Telescope, we see galaxies of various sizes, shapes, and ages [11]. However, in the last fifteen years, a puzzling fact about “star-forming galaxies” has been discovered: star formation rates seem to be consistent across galaxies with varying features and properties. These galaxies comprise the “Star-Forming Main Sequence” [22], in which all star-forming galaxies seem to form stars at the same rate. We also observe “quiescent galaxies,” which are galaxies that produce stars at a low rate. The currently accepted model of galaxy evolution classifies galaxies as either star-forming or quiescent [12]. There are two main reasons that astronomers accept this two-regime model.

First, multiple previous studies have shown that star-forming galaxies and quiescent galaxies can be clearly distinguished on a U-V vs. V-J (UVJ) diagram [14]. The U-V color band contains wavelengths from ultraviolet to visible light while the V-J color band contains wavelengths from visible to near-infrared light. Plotting galaxies of similar redshift and mass on a UVJ diagram that compares each galaxy’s fluxes in the U-V color band and the V-J color band, we see that star-forming galaxies and quiescent galaxies are clustered separately on either side of a distinct boundary (Figure 1).

Second, in today’s night sky, we observe red and blue galaxies, where quiescent galaxies appear red because they are dominated by red, low-mass stars, and star-forming galaxies appear blue because they are dominated by blue, high-mass stars [2]. We do not observe any galaxies that appear neither red nor blue. In this way, all star-forming galaxies appear to be identical—until we account for their IMF temperatures.

### 2.2 Three-Regime Model

Only the two regimes of “quiescent” and “star-forming” are apparent when studying the sSFRs, or specific star formation rates, of galaxies. This changes when we take into account the IMF temperatures of these galaxies, which are estimated with the IMF fitting method discovered by Sneppen et al. [21].

When we plot galaxies by log(sSFR) and IMF temperature, we see that galaxies with high values of sSFR, which are star-forming galaxies, span a wide range of temperatures. For example, in the hexbin plot of galaxies with a redshift value between 0.7 and 0.8, labeled a) in Figure 2, we see that there are galaxies that form a “horizontal branch” that have a different relationship between star formation rate and temperature than most of the other star-forming galaxies. While most star-forming galaxies have a similar temperature, the galaxies in this branch have a significantly higher temperature. This means that not all star-forming galaxies are physically similar, which contradicts the currently accepted theory.

Figure 2 shows six hexbin plots, each contain galaxies in redshift ranges of size 0.2, spanning from 0.5 to 1.7. Galaxies are binned by redshift because galaxies in different redshift ranges will appear different even if they are physically similar. This is because the wavelengths at which their fluxes are measured are significantly redshifted. The hexbin plot containing galaxies with redshift values between 0.5 and 0.7 is labeled, and all other plots should be interpreted identically.

We see that there are two distinct regimes of star-forming galaxies, which are labeled as Region 1 and Region 3. We label the galaxies in Region 1 (the “horizontal branch”) as the Pre-Main Sequence galaxies, because we believe that these galaxies are hot star-forming galaxies that are cooling into the Star-Forming Main Sequence, which we have labeled as Region 3. We have allocated Region 2 as a transition phase between the Pre-Main Sequence galaxies and the Star-Forming Main Sequence galaxies. Region 5 contains the quiescent galaxies and Region 4, similarly to Region 2, is a transition phase between the Star-Forming Main Sequence galaxies and the quiescent galaxies.

## 3. COSMOS Dataset

While observing the spectra of galaxies is likely the most accurate method of studying them, observing a high-resolution spectrum of a galaxy can be expensive and time-consuming. Thus, many astronomers use photometry to study galaxies, which measures the total amount of light emitted by a galaxy in certain “bands” of wavelengths [9]. While observing a high-resolution spectrum of a galaxy can be a long and expensive process, it is much quicker and cheaper to measure the amount of total electromagnetic radiation that the galaxy emits in wider “bands”. It is generally accepted that photometry is an accurate method of studying galaxies because it is unlikely that galaxies emit light in completely arbitrary combinations of wavelengths [17]. Thus, we can model a galaxy as if it is a nearby galaxy if we are able to observe that it emits light in certain “bands”. Previous studies of galaxies using photometry have been able to accurately deduce their properties, so we have evidence that photometry is a valid and efficient method of studying galaxies [17].

The COSMOS dataset of galaxies contains the “band” measurements required to conduct a photometric study. It catalogs about one million galaxies and twenty-seven photometric bands for each galaxy, making it one of the largest multi-wavelength catalogs of galaxies [15]. We see that COSMOS is a twenty-seven-dimensional dataset, where each photometric band is an “attribute” of the COSMOS dataset.

For our study, the flux measurements of the galaxies were retrieved from COSMOS2020, the most recent version of COSMOS published in 2021, while other physical properties of the galaxies were retrieved from COSMOS2015, the version of COSMOS published in 2016. The COSMOS2020 and COSMOS2015 datasets contain the same galaxies, each identified with a common unique identification number.

## 4. Dimensionality Reduction

Dimensionality reduction, an unsupervised machine learning technique, is the process of reducing the number of dimensions, or “attributes,” of a dataset [7]. Rather than keeping a select number of dimensions and discarding the excess ones, dimensionality reduction algorithms create entirely new dimensions that preserve information from each of the original dimensions. Dimensionality reduction algorithms have successfully been used in previous studies to classify galaxies [25]. There are two main reasons that dimensionality reduction is a useful tool for analyzing high-dimensional datasets.

The first reason can best be summarized as the “curse of dimensionality.” In a high-dimensional space, data points are very sparse, meaning that most data points are so sparse that they seem dissimilar [6]. Sparsity is measured using the Euclidean distance metric, which is the standard distance metric used to compute distances between points in a high-dimensional space. Because most data points are extremely sparse, clustering algorithms produce statistically insignificant results when applied to a high-dimensional dataset [1]. Furthermore, when distances are computed in a high-dimensional space, noise may be overemphasized since the noise may be present in many different dimensions [16]. High-dimensional spaces also have properties that are unintuitive to human analysts, since we are more familiar with two and three-dimensional spaces [13]. By applying dimensionality reduction to a high-dimensional dataset, we are able to overcome these issues by working with the data points in just two dimensions.

Another reason why we use dimensionality reduction is that it is hard, if not impossible, to visualize a high-dimensional dataset. In a one-dimensional graph, data points are plotted along a single line, making it very easy to visualize and compare them. In a two-dimensional graph, which is a very commonly used data visualization tool, each data point is ordered from left to right in ascending order in one attribute, and bottom to top in ascending order in the other attribute. As such, two-dimensional graphs are also very easy to visualize. A three-dimensional graph is harder to visualize than a one-dimensional or two-dimensional graph, although it is still feasible by creating an illusion of depth of perception. We may even try to use other visible attributes such as the colors or the sizes of the data points to represent values in four or five dimensions on a graph. However, there are not enough creative ways we can visualize a very high number of dimensions on a two-dimensional surface. There is clearly no feasible way to visualize a dataset like COSMOS which contains twenty-seven dimensions. Applying dimensionality reduction to these high-dimensional datasets allows us to visualize the data points on a standard two-dimensional graph.

### 4.1 t-SNE Algorithm

t-SNE, or t-distributed stochastic neighbor embedding, is one instance of a dimensionality reduction algorithm [28]. The t-SNE algorithm takes as inputs the high-dimensional dataset and the intended dimensionality of the output space, and it embeds the data points into a lower-dimensional space. In the process, t-SNE preserves the nearest neighbors of each data point but does not necessarily preserve exact distances and point densities. For most applications of t-SNE, a two-dimensional output space is selected for visualization purposes. The reduced dimensions created by t-SNE, or any other dimensionality reduction algorithm, are unitless and the numerical values of data points in these dimensions are insignificant.

t-SNE also has an input parameter called “perplexity” [8]. Perplexity can be thought of as the number of nearest neighbors that t-SNE factors in for each point when embedding the data points in a lower-dimensional output space. Choosing an optimal value of perplexity is accomplished through a heuristic approach. In particular, to choose a value for perplexity for a given iteration of t-SNE, a guess is made for the average number of relevant neighbors that each point in the dataset has. The value of perplexity can then be increased or decreased based on a qualitative observation of the map of the output space. Generally speaking, a lower value of perplexity means that t-SNE will preserve the local relationships between the data points, while a higher value of perplexity means that t-SNE will preserve the overall global structure of the data [27].

In order to justify why the t-SNE algorithm was used for this project, we first compare t-SNE to another commonly used dimensionality reduction algorithm: principal component analysis, or PCA. While PCA is a linear algorithm, t-SNE is a non-linear algorithm [10, 18]. PCA reduces the dimensions of a dataset into n dimensions by creating a basis of n orthonormal vectors of the original values and transforming each of the values into a linear combination of the n basis vectors. On the other hand, t-SNE uses a cost function called KL divergence to create a clustering of the data points in n dimensions by minimizing the cost between all of the points [26, 19].

KL divergence, or Kullback-Leibler divergence, calculates the relative entropy between two probability distributions, where relative entropy is the measure of how a probability distribution differs from another probability distribution [26, 19]. For two continuous probability distributions P and Q defined over an interval [a, b], the relative entropy between P and Q, denoted by $D_{KL}(P\,||\,Q)$, is defined as:

$D_{KL}(P\,||\,Q) = \int_a^b p(x)\,\log{\left(\frac{p(x)}{q(x)}\right)}dx$

where p and q are the probability density functions of P and Q respectively.

### 4.2 Applying t-SNE to MNIST

We briefly study the application of t-SNE to the popular MNIST dataset to illustrate how t-SNE works. MNIST is a dataset of 42,000 images of digits from 0 to 9, where each image is 28 pixels across and 28 pixels down [3]. This means that each image contains $28^2 = 784$ total pixels. Each pixel can be considered an attribute, or “dimension”, of the dataset, which makes MNIST a 784-dimensional dataset. For the previously detailed reasons, we apply the t-SNE dimensionality reduction algorithm to cluster and map the images in a two-dimensional output space.

Non-linear dimensionality reduction algorithms like t-SNE can cluster the images in MNIST more effectively than linear dimensionality reduction algorithms such as PCA. This is because an image of a digit composed of a grid of pixels cannot be logically represented by PCA as a linear combination of basis vectors, unlike other objects such as words or numbers. Thus, PCA is unable to accurately cluster images of the same digit together, while t-SNE is able to do so.

As Figure 3 shows, for the MNIST dataset, focusing on the local structure of the clusters by choosing a low perplexity value yields the most accurate and well-defined clusters. This is an unexpected result because MNIST is quite a large dataset, and focusing on global structure tends to work more effectively for large datasets. However, we can reason that a low perplexity value worked well for MNIST because each image only has a few other images that look similar to it.

As seen in Figure 4, t-SNE has accurately clustered together images of the same digit in a two-dimensional space. The way in which t-SNE arranges the distinct clusters of digits is also significant. Consider the green cluster of images of the digit ‘3’ in the middle of the map. An image of a ‘3’ on the left side of the cluster, nearer to the orange cluster of images of the digit ‘1’, is more vertical—resembling a ‘1’. On the other hand, an image of a ‘3’ on the right side of the cluster, nearer to the purple cluster of images of the digit ‘8’, is more round and more closely resembles an ‘8’. Even across clusters, t-SNE places data points near other similar points, which is useful in other real-world datasets such as COSMOS that lack well-defined clusters.

## 5. Applying t-SNE to COSMOS

For the purposes of our study, we worked with a subset of the COSMOS dataset containing only the galaxies that have an IMF temperature assigned by the IMF fitting method. This was done because we need the IMF temperature of a galaxy to label it with one of the regimes of our three-regime model of galaxy evolution. This subset of the COSMOS dataset that we used in our study contains 97,217 distinct galaxies, which is about 9.72% of the entire COSMOS dataset.

Preprocessing a dataset before applying an algorithm such as t-SNE facilitates more accurate and efficient data analysis [23]. In our analysis of COSMOS, we applied two preprocessing techniques: removing erroneous data and feature scaling (normalization).

The COSMOS dataset contains erroneous data in the form of nonmeasurements and nondetections, where some galaxies in the dataset have erroneous or no measurements in some or all of the photometric bands. In COSMOS, the value -99 denotes a nonmeasurement or nondetection for a galaxy in a certain photometric band. By removing the roughly 8,000 galaxies in our subset with flux values of -99 in any photometric band, we were able to create a more accurate output map.

Normalizing the COSMOS dataset was a critically important step because the galaxies are at different distances, which means flux measurements alone cannot be used to compare the luminosities of galaxies at different distances. Consider this example: Galaxy X has flux measurements of 1 in Photometric Band A and 1 in Photometric Band B, Galaxy Y has flux measurements of 1 in Band A and 2 in Band B, and Galaxy Z has flux measurements of 4 in Band A and 4 in Band B. However, Galaxy X and Y are at the same distance away while Galaxy Z is at half the distance of the other two galaxies. According to the inverse square law, Galaxy X and Galaxy Z actually have the same luminosity in both bands. However, our dimensionality reduction algorithms would consider Galaxy X and Galaxy Y to be more similar than Galaxy X and Galaxy Z because the Euclidean distance between them in these two measurements is smaller (1 between Galaxy X and Galaxy Y; $3\sqrt{2}$ between Galaxy X and Galaxy Z). Thus, to fix this bias, we normalize all of the measurements to one band. That is, all of the flux measurements are scaled such that every galaxy has the same flux value in one of the bands, and the measurements in the other bands are scaled accordingly. In this example, Galaxy X would be scaled to have flux measurements of 1 in Band A and 1 in Band B, Galaxy Y would be scaled to have flux measurements of 1 in Band A and 2 in Band B, and Galaxy Z would be scaled to have flux measurements of 1 in Band A and 1 in Band B.

After then applying t-SNE to the preprocessed COSMOS dataset, we found that the primary attribute that influenced the clustering of points was redshift. This was because two similar objects would look completely different at different redshifts because the flux measurements at one set of wavelengths for one galaxy would all be redshifted to a new set of wavelengths for the other galaxy. Due to this, we created ranges of redshifts and created t-SNE plots of galaxies in each of these bands. The redshift ranges range from 0.5 to 1.5 and are of size 0.1, meaning that the first redshift range contains galaxies from redshift 0.5 to 0.6, the second redshift range contains galaxies from redshift 0.6 to 0.7, and so on. Each galaxy in one of these ranges has values of 1+z, which is 1 added to the redshift value, within 10% of each other.

## 6. Results of t-SNE on COSMOS

After experimenting with different values of perplexity, we used a perplexity value of 40 to generate our final two-dimensional t-SNE maps. This is a moderately high value of perplexity.

Figure 5 shows ten two-dimensional t-SNE maps, with each containing the galaxies in COSMOS in the labeled redshift range of size 0.1. In each of these maps, it is apparent that there is a gradual transition from galaxies in Region 1, which are the Pre-Main Sequence galaxies, to galaxies in Region 3, which are the Main Sequence Galaxies, and then to galaxies in Region 5, which are the quiescent galaxies. Galaxies in Regions 2 and 4, which were the transitions, also seem to be properly located on the maps so that the galaxies are arranged roughly sequentially from Region 1 up to Region 5.

We see that in each of the ten t-SNE maps, there appears to be a small cluster of galaxies that are separated from the main, large cluster of galaxies. By coloring the t-SNE maps using various physical attributes of the galaxies, we have identified the galaxies in these small clusters to be galaxies with high amounts of dust emission. Even within these small clusters, we see that galaxies are arranged roughly sequentially by region.

To more clearly illustrate this sequential ordering, we also plotted hexbin plots for the same ten redshift ranges. Each bin is colored by the modal regime, which is the regime that is most frequently represented in that bin. In other words, the mode was the function used to reduce all of the values in a bin to a single value. The ten hexbin plots are displayed in Figure 6, which clearly display the sequential ordering of the regimes.

## 7. Conclusion and Outlook

The two-dimensional t-SNE maps and the hexbin plots cluster together galaxies of the same regime, meaning that pre-main sequence galaxies are most similar to other pre-main sequence galaxies, main sequence galaxies are most similar to other main sequence galaxies, and quiescent galaxies are most similar to other quiescent galaxies. We know this based on our observation that t-SNE tends to cluster together data points that are similar to each other.

We observe that the clusters of regimes are arranged in sequential order, from pre-main sequence galaxies, to the main sequence galaxies, to the quiescent galaxies, with the two clusters of “transition” galaxies also being placed correctly between the three main clusters. As demonstrated with the application of t-SNE to MNIST, we know that t-SNE places data points near other similar points even across clusters. Using this knowledge, we see that t-SNE is indicating that galaxies in one region of our maps are more similar to galaxies in directly preceding or succeeding regions than to galaxies in other regions.

We conclude that the t-SNE maps illustrate and provide evidence for our three-regime model of galaxy evolution, where a hot, star-forming galaxy cools into the Star-Forming Main Sequence, which then transitions into a quiescent galaxy when its blue, high-mass stars die out and star formation ceases.

It is clear that dimensionality reduction is a useful tool that can be used to uncover realities in datasets that are not discernible through mere observation. Specifically, in the field of astronomy, applying t-SNE or other dimensionality reduction algorithms to datasets of other celestial objects such as stars and galaxy clusters may allow us to discover new groups and classifications.

## Acknowledgements

First, I would like to thank my mentor, Professor Charles L. Steinhardt, for his guidance throughout this research project. I would also like to thank Thomas H. Clark, Andrei C. Diaconu, and the faculty at the Cosmic Dawn Center for their support. This project would not have been possible without the generous funding contributed by Mr. and Mrs. Hassenzahl.

## References

1. Armstrong D. P., Mccarthy M. A., 2007, Avian Conservation and Ecology, 2
2. Astrobites 2017, Bulges Are Red, Disks Are Blue…, https://aasnova.org/2017/01/31/bulges-are-red-disks-are-blue/
3. Beohar D., Rasool A., 2021, 2021 International Conference on Emerging Smart Computing and Informatics (ESCI)
4. Brammer G. B., et al., 2011, The Astrophysical Journal, 739, 24
5. Bridge J., 2016, The changing star formation rate of the universe, https://astrobites.org/2016/11/11/the-changing-star-formation-rate-of-the-universe/
6. Cevher V., Hegde C., Duarte M. F., Baraniuk R. G., 2009, 10.21236/ada520187
7. Chetana V. L., Kolisetty S. S., Amogh K., 2020, Recent Advances in Computer Based Systems, Processes and Applications, p. 3–14
8. Devassy B. M., George S., Nussbaum P., 2020, Journal of Imaging, 6, 29
9. Fernandez R. L., et al., 2016, Monthly Notices of the Royal Astronomical Society, 458, 184–199
10. Forsyth D., 2019, Applied Machine Learning, p. 93–115
11. Ghosh B., Durrer R., Sch ̈afer B. M., 2021, Monthly Notices of the Royal Astronomical Society, 505, 2594–2609
12. Gomez-Guijarro C., et al., 2019, The Astrophysical Journal, 886, 88
13. Khoury M., 2018, https://marckhoury.github.io/blog/
counterintuitive-properties-of-high-dimensional-space
14. Labbe I., et al., 2005, The Astrophysical Journal, 624
15. Laigle C., et al., 2016, The Astrophysical Journal Supplement Series, 224, 24
16. Litvinenko A., Matthies H. G., 2009, Pamm, 9, 587–588
17. Masters D., et al., 2015, The Astrophysical Journal, 813, 53
18. Mcinnes L., Healy J., Saul N., Großberger L., 2018, Journal of Open Source Software, 3, 861
19. Nishiyama T., Sason I., 2020, Entropy, 22, 563
20. Peng H., Pavlidis N., Eckley I., Tsalamanis I., 2018, 2018 IEEE International Conference on Big Data (Big Data)
21. Sneppen A. B., Steinhardt C. L., 2020
22. Speagle J. S., Steinhardt C. L., Capak P. L., Silverman J. D., 2014, The Astrophysical Journal Supplement Series, 214, 15
23. Srivastava M., Srivastava A. K., Garg R., 2019, SSRN Electronic Journal
24. Staff S., 2012, Star Formation Sputtering Out Across the Universe, https://www.space.com/18370-universe-star-formation-rate-decline.html
25. Steinhardt C. L., Weaver J. R., Maxfield J., Davidzon I., Faisst A. L., Masters D., Schemel M., Toft S., 2020, The Astrophysical Journal, 891, 136
26. Wang H.-L., 2008, Acta Automatica Sinica, 34, 529–534
27. Wattenberg M., Viegas F., Johnson I., 2016, How to Use t-SNE Effectively, https://distill.pub/2016/misread-tsne/
28. van der Maaten L., Hinton G., 2008, Journal of Machine Learning Research, 9