Identifying Optimal Proteins by Their Structure Using Graph Neural Networks

Proteins are the molecules of life, essential for a wide range of functions from protecting the body from viruses to building the structures of all living things. Protein engineering has been used to repurpose these machines, and proteins can be optimized for new, valuable functions for technological, scientific, and medical applications. New approaches to protein engineering leverage machine learning to find the best proteins for a given application with limited experimental characterization. A machine learning model can be used to predict how well a protein performs a given task, in other words, its “fitness”. The way in which we computationally represent proteins is crucial in determining how these machine learning models predict protein properties. While most approaches use a protein’s sequence to represent proteins, we hypothesize that protein structure provides richer information for the model to learn these properties. We represent each protein of interest as a graph, or a network of amino-acid connections in the protein, and implement a graph machine learning model to predict a protein’s fitness. We show that this structure-based model has superior performance to sequence-only approaches for fitness prediction. We further extend this model to automatically find the best protein for a given task by optimizing a protein’s graph representation. We demonstrate that this novel approach is the most successful in finding the best protein from a small number of proposed proteins.

Author: Marianne Arriola
University of California, Santa Barbara
Mentors: Frances Arnold and Kadina Johnston
California Institute of Technology
Editor: Wesley Huang

Using Machine Learning to Explore the Non-natural World

Protein engineering has been used to uncover and optimize protein sequences that are not found in the natural world but have valuable functions for technological, scientific, and medical applications [1]. Protein engineers hope to find proteins with maximal “fitness,” a measure of how well a protein performs a desired function – for example, proteins that are strongest at enacting responses for disease treatment or proteins that are the most stable in different environments. Machine learning has accelerated the lab-based search for optimal proteins by computationally predicting protein fitness [1].

To predict fitness, one represents the protein in a data structure understood by a model so that it extracts as much useful information about the protein as possible. Informative protein representations are essential; they determine how the model learns from a protein’s characteristics to predict its fitness. The most common approaches represent proteins as sequences of amino acids, the basic building blocks of proteins [1]. However, protein structure is crucial for function and therefore may better represent proteins by encoding how a protein twists, folds, and bends to determine its function. Recent work demonstrates that graphs, a data structure used to describe relationships between entities, efficiently represent protein structure [3]. A protein can be instantiated in a graph through nodes (amino acids) and connecting edges (distances between amino acids) (Figure 1). Previous work has found that these representations provide higher accuracy in predicting structure-function relationships than sequence-based representations [3]. Though graph representations have been used to classify protein function, to the best of our knowledge they have not yet been used to predict protein fitness. Therefore, we propose graph-based approaches to represent and evaluate proteins.

Figure 1: Graph-based representation of a protein structure.

Research objectives

We test whether structure provides more information than sequence for fitness prediction

We hypothesize that a structure-based approach will result in higher predictive accuracy than sequence-based approaches because structure directly impacts protein function and therefore may provide more useful information to predict fitness.We implemented a graph convolutional network (GCN), a machine learning model that learns on graph-structured data, to predict protein fitness using structural information which would enable us to identify high-functioning proteins. We then measured the advantage that a structure-based prediction model has over a sequence-based one by comparing their predictive performances. Moreover, this approach can validate or potentially improve upon computational fitness prediction in the Arnold Lab for optimal protein identification.

We explore whether fitness prediction models can predict activity on different substrates

Protein fitness is dependent on the identity of the substrate, or protein-activating molecule, that the protein performs activity on. We propose that activity on one substrate cannot predict the activity on other substrates because a machine-learning model is tuned and trained based on the patterns in the data it is given (in this case, activity on a particular substrate). The model can have less predictive accuracy when given data with different trends (activity on alternate substrates).

Substrate structure is also crucial to protein activity: protein activity can vary widely in response to subtle changes in substrate structure. When nature has evolved a protein for a specific substrate, that activity is referred to as natural activity. Activity on substrates with fundamentally different structures is termed non-natural activity, and, importantly, it is an activity that nature has not evolved for and must be engineered. We explored model performance of predicting non-natural protein activity compared to natural activity. A model that can accurately predict non-natural activity is valuable because protein engineers typically pursue activities which nature has not yet optimized.
We attempt to automate optimal protein identification

We define a novel method to efficiently search through spaces of large numbers of variants using a protein’s graph representation. We believe that this approach could enable automated discovery of optimal proteins for a given function without having to explore every possible variant, which is computationally expensive. The model should deduce which structural patterns result in higher fitness to find the optimal protein.


How do we test our models?

We tested our approaches to find the optimal protein amongst all ~6000 single variants ⎼ variations of a protein with a mutation at a single amino acid position in the sequence ⎼ of amiE, a gene that encodes an enzyme from P. aeruginosa [4]. Wrenbeck et. al quantified the fitnesses of these variants (cell growth, in this case) when performing activity on three substrates: acetamide (ACT), propionamide (PR), and isobutyramide (IB). Activity on ACT and PR is considered more natural, as they have structures for which amiE has been naturally evolved, than comparatively unnatural activity on IB, which has a fundamentally different structure [4].

How can we encode protein structure into a graph?

We used Triad, a physics-based protein modeling software, to predict protein structure from sequence. We then used the predicted structure to generate its graph representation that describes amino-acid contacts in the protein. We applied a language model, a model architecture that typically learns from text (in this case, protein sequence) [3], to extract amino acid-level features used to describe the nodes of the graph representation. Figure 2 demonstrates the construction of the protein’s graph representation, where structure and sequence information are used in tandem.

Figure 2: Illustration of protein graph representation construction.

Can structure-based representations improve fitness predictions?

We trained a GCN to learn and subsequently predict protein fitness from its graph representation, as illustrated in Figure 3. This model takes in a graph as an input that represents some protein then passes it to a series of convolutions. These convolutions uncover potentially valuable hidden information about the protein by increasing the complexity of the graph representation via message passing, a framework in which node information is passed to connecting nodes at each step. In this context, sequence information about each amino acid is used to update the information describing other amino acids. The GCN produces a new, higher-level graph representation where each node not only contains information about itself, but also about its neighbors. This is followed by a pooling layer and an activation function that reduce the dimensionality of the graph to a single fitness score. The model ‘learns’ to output an accurate fitness score by learning which node connections to assign more priority to in the convolutions that result in more accurate prediction.

Figure 3: Graph convolutional network that uses a protein’s graph representation to predict its fitness. The input graph structure undergoes a series of convolutions that results in a new, complex graph representation. This representation undergoes pooling and activations that reduce its dimensionality to a single fitness score.

How can one efficiently navigate a protein fitness landscape using structure?

We defined a method to efficiently find the optimal protein rather than predicting the fitnesses of all candidates, which is computationally expensive. To do so, we only predict the fitness of a smaller subset of the variants: we optimize the graph structure to retrieve the one that has the highest predicted fitness. We modified a protein’s graph representation to find a new one in the “landscape” of all potential protein variants that is likely to increase predicted fitness until the protein with the best fitness is found (Figure 4) [5].

Figure 4: Fitness landscape navigation workflow. A graph representing a protein is modified to maximize predicted fitness, while keeping a constraint that the graph represents a physically plausible protein. The constraint is maintained by ensuring the graph is taken from that of known variants


Structure-informed representations improve fitness predictions

Our hypothesis was confirmed by comparing the performance of structure-based and sequence-based fitness prediction models. Performance here is measured by the correlation between true and predicted fitnesses: high correlation (above 0.5) means that the model can accurately identify high-functioning and low-functioning variants. Representations that utilize structure yield higher performance than those that solely rely on sequence (Figure 5). This demonstrates that structure-informed protein fitness prediction models are more accurate than the baseline sequence-based approach. This confirms that protein structure is more informative than sequence.

Figure 5: Comparison of structure-based and sequence-based representations for fitness prediction. The x-axis corresponds to the substrate used for fitness prediction and the y-axis corresponds to the correlation between the true and predicted fitness. The bar color indicates whether a sequence-based or structure-based model was used.

Substrate identity results in independent fitness predictions

We show that enzyme activity is substrate-specific as the fitness prediction model performs poorly at predicting the activity on different substrates other than the one that is used for model training. As Figure 6 shows, the model that performs best at predicting the activity on a given substrate is the model that is trained on activity on that substrate. As fitness predictions on one substrate cannot be transferred to another, enzyme activity must be substrate-specific.

Furthermore, for the model that performs best at predicting activity on a given substrate, our approach is best at predicting non-natural activity on IB rather than natural activity on ACT and PR. Thus, our results imply that our model can best predict non-natural activity. We suggest that this is due to the fact that IB has a wider range of fitness values across variants compared to activity on ACT and PR; a wider range of fitness values means that the model can learn more structure-fitness patterns whereas a small range limits what the model can learn.

Figure 6: Models were trained on fitness data from one of the three substrates. Trained models were then used to predict fitness values for data from each of the three substrates. The x-axis corresponds to the substrate used for fitness prediction and the y-axis corresponds to the correlation between the predicted and true fitness. The bar color indicates what fitness data the model was trained on.

Graph optimization can identify optimal proteins in small fitness landscapes

We designed a method for finding proteins of optimal fitness by optimizing and repeatedly updating the graph representation of the protein. We found that in a small landscape of 30 variants, the model could successfully identify the variant with the optimal fitness. Figure 7 shows the updated representation’s similarity to the optimal variant throughout training: ideally, the model would produce a variant that is identical to the optimal variant, with similarity increasing as training round increases. As shown in Figure 7 (left) the model navigates to the optimal protein at around the 35th training round, where the similarity oscillates between the variant the model deems to have ideal fitness and another variant in the landscape. This pattern signifies that the model has already found a protein that it deems to be optimal and can no longer navigate the landscape to find new variants.

Figure 7: Similarity of model-produced graph to optimal graph across optimization rounds. Left: model performance when navigating through 30 variants. Right: model performance when navigating through 600 variants.

However, in a larger landscape of 600 variants, the model finds a protein with higher fitness, but not the variant with 100% similarity to the optimal, as shown in Figure 7 (right). This is likely because the model finds a protein with maximum fitness relative to proteins nearby in sequence space rather than relative to the entire landscape. Figure 8 shows the distribution of fitnesses across sequence space, where there are multiple local maxima in fitness scattered throughout the landscape that the model can incorrectly identify as the optimal protein.

Figure 8: Distribution of average fitnesses of single-mutant amiE variants on the propionamide substrate (used as an example case) across sequence-space. The x-axis corresponds to a variant’s mutation position in the sequence; the y-axis corresponds to the average fitness. Variants with higher fitness have bars closer to the top of the graph (fitness of 0, relative to the wild type activity). Only variants with mutations after the 50th amino acid are included for clarity.


Structure-based fitness prediction improves over a sequence-based approach, making it a promising fitness prediction tool that is more accurate for finding optimal proteins for a given function. However, there are limitations: it would be computationally expensive to compute the structure for every variant rather than simply using sequence for fitness prediction depending on the number of variants. The user must assess the number of variants that will be used as it may be of less interest to use a graph-based approach for a larger number of variants.

We found that the structure-based model performs considerably better at predicting non-natural activity than natural activity. This implies that the model can be utilized in predicting the fitnesses of potentially valuable non-natural proteins that can be created via protein engineering.

This work has only been tested on a single-mutant fitness landscape, where there are only small changes in structure. However, it may be of interest to search for the best protein through variants with multiple mutations, where there are larger changes in structure. If there are larger changes in structure, the model may have more success at differentiating between structures in producing fitness predictions. Unfortunately, landscapes with multiple simultaneous mutations are largely populated by variants that exhibit little to no activity (extremely low fitness), making navigation a more difficult task.

Fitness landscape navigation is of interest as it would allow one to automatically find optimal proteins for a given function without lab experimentation and without calculating every candidate’s fitness. This is a novel approach in machine learning for fitness prediction because most related approaches only focus on predicting a given protein’s fitness. However, navigating diverse landscapes with different topologies may pose challenges. Our landscape navigation method aims to incrementally find proteins with higher fitness until it finds the optimal protein (in other words, the highest peak in the landscape) as shown in Figure 9. The navigation model proposed here incrementally finds better proteins until it has found a peak which it assumes is the optimal protein, but there is no way of guaranteeing that it is the highest peak (the true optimal protein) without exploring every point in the landscape. This makes the model successful in finding the optimal protein in a landscape with only one peak (Figure 9, left). However, if there are multiple peaks, the model may find a peak that is not the highest and incorrectly assume that it is the optimal protein (Figure 9, right).

Our fitness landscape navigation model can successfully find the optimal protein in a small landscape, but it was unsuccessful in a larger landscape likely because it is more complex and has multiple peaks. Future work to make this navigation robust to diverse landscapes will give researchers freedom to find any optimal protein, no matter how smooth or rugged the landscape is.

Figure 9: Visualization of fitness landscape topologies, where the horizontal axes correspond to sequence space and the vertical axis correspond to fitness. Fitness landscapes can either have a smooth topology (left) where there is a single peak that is the optimal protein, or a rugged topology (right), where there are multiple local peaks across sequence space that are not necessarily the optimal protein.


In summary, we first found that protein structure is more informative than its sequence for protein fitness prediction. We also found that non-natural activity is more predictable than natural activity. Third, we showed that a fitness prediction model trained on a specific substrate does best at predicting activity on that substrate. Finally, the fitness landscape navigation model proposed here to automatically find the best protein for a given task succeeds in smaller landscapes but struggles in larger, complex landscapes.

It is worth exploring this approach further to enable successful navigation of larger, diverse fitness landscapes: for example, landscapes of variants that have multiple mutations. Navigating through variants with multiple simultaneous mutations is likely to result in barren landscapes with few peaks; it may be of interest to define a method to navigate through such landscapes. A potential approach would begin several navigation searches at different points in sequence space, with the hope that the model will find different peaks.

This work is a promising step in informed protein engineering, which will enable better protein identification needed for impactful technological, scientific, and medical applications.

Further Reading

Exploring protein fitness landscapes by directed evolution

Machine-learning guided directed evolution for protein engineering

Structure-based protein function prediction using graph convolutional networks


  1. Yang, K. K., Wu, Z. & Arnold, F. H. (2019) Machine-learning-guided directed evolution for protein engineering. Nat Methods 16, 687–694. DOI: 10.1038/s41592-019-0496-6
  2. Romero, P., Arnold, F. (2009) Exploring protein fitness landscapes by directed evolution. Nat Rev Mol Cell Biol 10, 866–876 DOI: 10.1038/nrm2805
  3. Gligorijević, V., Renfrew, P.D., Kosciolek, T. et al. (2021) Structure-based protein function prediction using graph convolutional networks. Nat Commun 12, 3168. DOI: 10.1038/s41467-021-23303-9
  4. Wrenbeck, E., Azouz, L. & Whitehead, T. (2017) Single-mutation fitness landscapes for an enzyme on multiple substrates reveal specificity is globally encoded. Nat Comm 8, 15695. DOI: 10.1038/ncomms15695
  5. Li, J., Tianyun Z., Hao T., Shengmin J., Makan F., and Reza Z.  (2020) SGCN: A Graph Sparsifier Based on Graph Convolutional Networks. Advances in Knowledge Discovery and Data Mining 12084: 275.


I would like to thank my co-mentor, Kadina, for being supportive of my ideas, patient with my work, and for giving me valuable insights on my project; I would also like to thank my mentor, Dr. Frances Arnold, for giving me the opportunity to work outside of my discipline in chemical engineering research, and Dr. Sabine Brinkmann-Chen for taking the time to give helpful feedback on my reports and presentations. I thank my principal investigator from the DYNAMO Lab at UCSB, Dr. Ambuj Singh, for inspiring me to pursue graph machine learning. Finally, I send my thanks to the Resnick Sustainability Institute (RSI) and Student-Faculty Programs (SFP) at Caltech for providing me with financial support on my project.

Interview with Professor Joseph Falson

Interviewer: Audrey DeVault


Dr. Joseph Falson joined Caltech’s faculty in 2019 as a visiting associate with the Department of Applied Physics and Materials Science, and became an Assistant Professor of Materials Science in 2020. His research focuses on the synthesis and characterization of quantum materials that display emergent functionalities. Dr. Falson received his B.S from the University of New South Wales (2009), M.S. from Tohoku University (2012), and Ph.D. from the University of Tokyo (2015).


  • Favorite food: Very good sushi; I did my PhD in Japan and was exposed to very good sushi by my seniors. I go very seldom, but I appreciate it.
  • Favorite country: Japan. It is one of the most beautiful countries on earth. It’s got everything, it’s really quite an amazing place.
  • Favorite musical genre: I enjoy all types of music, but I have recently been enjoying LA’s KJazz 88.1 radio station. It’s quite famous, and you can stream it from around the world; it’s fantastic!
  • Favorite theory/effect: The Fractional Quantum Hall Effect. It is a very beautiful, complex, layered, and emergent property of condensed matter systems. It deals with the concept of a system as a whole being more than the sum of its parts. The effect only emerges in many-body systems and is very difficult to predict a priori. It is nice to find properties of materials that are difficult to predict theoretically (and be ahead of the theorists). It’s very curiosity driven.
  • A fun fact most people don’t know: Most people don’t know that I’m Australian. I haven’t lived in Australia in about 10 years now, so it’s very rare that people pick up on my accent. Of course, Australians would hear me and know, but otherwise it’s difficult for people to pinpoint.

What type of research does your lab focus on?

Probing the fundamental properties of nature. Specifically, we look at crystalline materials. We grow crystals in “thin film” form, on the scale of 10 or 100 nm thick. When crystals are made very thin, their properties begin to change due to dimensionality effects. In quantum mechanics you may have seen the example of a quantum well, where the energy levels in some spatial direction become quantized. So now you have a combination of very subtle material properties that are built into the crystal structure, and you can perturb it by making the crystals into the shape that you want.

The reason why we try to make really clean crystals is because if you have crystals that are highly defective, and electrons inside them, the electrons will begin to feel the effects of the lattice much more than the effects of the other electrons. So in some situations, the lattice becomes like a secondary property, and we can begin to see the subtleties of how the electrons talk to each other while dressed in some property of the lattice. Usually these properties are very delicate, and only appear in the cleanest materials, often at very low temperatures (about 1 K).

What motivates your research?

We are motivated by the search for new paradigms. Scientists are often asked about the real-world applications of their research, but for us, it isn’t about creating a new material for use in phones or cameras. We are focused on the exploration of fundamental properties of nature.

That’s not to say our research won’t have amazing applications in the future; everything around us started out as fundamental research. Semiconductor science used to be considered uncontrolled and unimportant in comparison to other more well-understood, controllable fields like chemistry and biology. But now I am looking at a billion semiconductors on my screen. Lasers too; when the laser was invented, it wasn’t clear what its applications would be. But now our internet service is based on fiber optic cables across the seas. So just because an application is opaque at the outset doesn’t mean that it’s going to remain that way going forward.

There aren’t that many places where you can pursue pure, fundamental science like this, but Caltech is one of the few places that you can.

How did you get into Materials Science?

100% by chance, I have had a very long and winding road to get here. I was dedicated to doing medicine when I entered university but found very quickly that it wasn’t for me. I graduated with an undergraduate degree in chemistry, but at that point I was already sure that I didn’t want to do traditional wet chemistry. But I didn’t have a great idea of what I wanted to do, I just knew I was interested in electronics and crystal growth.

My 3rd and 4th year of undergrad I explored outside of my chemistry coursework and took some physics and materials science courses and found that I enjoyed semiconductor physics. When I began applying to grad school, I applied for a Japanese government scholarship that I had been introduced to in a humanities course, and won it. So I cold emailed some professors in Japan. I met with a few chemistry professors, and one directed me towards another potential mentor he though I would click with, who studied crystal growth and materials physics.

It was like falling in love in a scientific sense.

I had a very good connection with the mentor he directed me to. So I applied, got the scholarship, got into the school, and then went there for 5 years and got my PhD. I had an absolutely fantastic time. A lot of stars aligned. From there I made another international connection that led to a postdoc position in Germany for four years, and then applied to Caltech, and got the job. I was just driven by my gut feeling.

What I try to tell my students is that in life, the key to success is finding something you are able to do for 10 hours a day and not grow to hate it. If you want to be top-class successful at anything in life, it doesn’t matter how smart you are if you don’t put the time in. Especially in experimental science, so many things go wrong, and you encounter such difficult procedures, so you have to just be tough. I never get up and go to work and want to go home, and I hope that the people around me find that as well.

What have been some of the challenges and rewards of building a new lab? (How has the pandemic affected things?)

Everything has been challenging because it’s been very difficult to get extra manpower, and everything gets delayed. A lot of the equipment is now set up, but I had to do that 100% with my one student. For the vacuum equipment, usually a team would have flown in for two weeks to set it up, so it was quite a monumental task, and it’s taken us about 3 months of work. The good thing that’s come out of that is that now I know that machine like it’s the back of my hand. That’s my baby now, I know all the intricate details. So as hard as it’s been, I’m pretty happy.

What do you think is the most exciting new development in materials science?

I would say the most exciting thing isn’t necessarily a development, but more of a tide change. Early on materials science was very much semi-conductor based, because they were novel. Then the field moved on to studying a select class of very high quality semiconductors very intensely. But now we know nearly everything about them, so recently 2-D materials have been a big thing, and we are entering a stage where there is a strong appetite for a large variety of materials. The field is maturing to be more interdisciplinary, and the tide is shifting towards a search for more exotic, highly crystalline materials. The momentum has grown a lot in recent years, and there are a lot of new computational tools that help predict new materials instead of just doing blind experiments. Theorists are using machine learning and large data sets to make these predications, and it enriches my half of the problem.

What does your typical day look like? What do you do in your spare time?

On a typical day I go to lab at around 8 am, I talk with my grad student and we discuss our goals for the day. Then we try to focus on setting up equipment. I teach twice a week, and spend a lot of time planning and thinking about my lectures. My days are sometimes pretty long and difficult, and physically challenging as well, but it’s nice to see something set up at the end of the day.

In my free time I like to garden. I like it because it takes me 100% away from thinking about anything academic or sophisticated. I also like the idea that it’s like multitasking. When I’m at work, the plants are also hard at work growing.

My wife and I recently bought a house, and there are a lot of things we need to fix up, so at the moment I spend all my weekends sanding or painting. But it’s fun to be free to build stuff how you want it.

What is it like to travel the world as a scientist?

Attending international conferences and talking to people has resulted in a lot of collaboration, friendship, and support. Science is a very natural way of bridging boundaries.

Obviously, it is very different from a vacation as you are traveling to work, but I do think that its incumbent on scientists to enjoy their time while traveling. I think it’s one of the major perks of being a scientist.

I try to travel internationally as much as possible, though I will admit I am enjoying this one year where I don’t have to travel. In the year before I came to Caltech, I would fly from Europe to Caltech for a meeting, and then fly home the next day. There were some hard days and lots of 15 hours flights, so I am enjoying not travelling this year.

What do you love most about Caltech?

I like that science is the number one priority. Caltech places a lot of trust in its students and staff, including young people like me, and I enjoy the independence. The weather, and the campus are also very nice; it’s a beautiful place. There’s something special about rolling up to Caltech and going to the Athenaeum; it’s a very unique experience, very dignified.

I am very passionate about undergraduate mentoring, and I like that undergrads are strongly encouraged to participate in research.

What advice would you give to your younger self/Caltech undergrads?

Reflecting on my undergrad time, I didn’t have a designated advisor. I was very naive and was very lucky that everything worked out. I wish that I had forged a relationship with a mentor when I was younger and had someone to guide me. Luckily, I did have a mentor in graduate school, and that’s when my career really started to take shape. I would recommend you take full advantage of having an advisor. Especially going into your 20s, you think you know what’s going on, but looking back now, I had no idea. You should hunt for a good mentor, it should be someone who can advise you in research and courses, who is also in tune with your personal needs. A good mentor is a good human.

I would suggest you get involved with a lab as heavily as you can, and that you should think about lab work not as something tangential to your studies, but as your life education. It’s what they’re not going to teach you in class. I think you should take the most difficult classes possible in the first three years, research whenever you can, and then do a senior thesis. I personally didn’t do a senior thesis, and I 100% regret it. I have plans for every undergraduate in my lab to do a senior thesis. I strongly encourage it, because that’s what is going to be the most useful when applying to grad school.

Above all, though, you’ve got to have fun and challenge yourself.

Elucidating Catalysis with the “Gold Standard” of Quantum Chemistry

Author: Patryk Kozlowski
Mentor: Garnet K. Chan and Yang Gao
Editor: Hannah X. Chen


Due to the imminent crisis of climate change, developing sustainable methods for fuel and chemical production has become more important than ever. A significant step towards progress comes from the electrochemistry processes which use renewable energy to convert molecules from the atmosphere into higher-valued products. Catalysts are instrumental in such processes, increasing the rate, efficiency, and selectivity of the chemical transformations involved1.

One notable example is iron, which catalyzes the conversion of atmospheric nitrogen into ammonia (used in artificial fertilizers) in the Haber-Bosch process. In fact, the 2007 Nobel Prize in Chemistry was awarded to physicist Gerhard Ertl “for his studies of chemical processes on solid surfaces” in relation to the Haber-Bosch process2. Clearly, understanding the chemistry occurring at catalyst surfaces will be crucial for the future development of enhanced catalysts. In particular, we need to be able to accurately compute surface energies – quantities that govern the chemistry at catalyst surfaces.

The Catalytic Mechanism

Compared to atoms that are part of the bulk material, atoms on a surface are unstable due to their relative lack of stabilizing intermolecular attractions (Fig. 1a). This inherent instability can cause surface atoms of a catalyst to interact with nearby gas molecules (Fig. 1b). Once adsorbed to the catalyst surface, reactants undergo a desired chemical transformation and are then desorbed from the surface, at which point the job of the catalyst is complete.

Fig. 1: (a) Schematic of a solid, demonstrating surface instability in relation to the bulk3. (b) Graphic depiction of a molecule adsorbing onto a catalyst surface4.

Theoretical Approaches

The surface energy measures the instability of a surface relative to the bulk material. Since surface energies thereby give insight into the adsorption process, we would really like to be able to compute them accurately using theory. Though density functional theory (DFT), the current workhorse computational method of surface science, has proven inadequate to this end, wavefunction-based methods that treat correlations between electrons have shown promise as a viable alternative5.

Coupled-cluster theory (CC), established as the “gold standard” of quantum chemistry at the molecular level, is one such framework which has been used to study materials6–10. CCSD methods (like coupled cluster theory including perturbative singles and doubles) have shown increased accuracy with respect to DFT in computing the properties of various materials6,11.

However, CCSD has only recently been used to study surface chemistry, which is relevant to catalysis. In this work, I compare the performance of the simpler MP2 and more robust CCSD perturbative frameworks with that of DFT for computing energies of the (111) surface of platinum, a common catalyst.

Computational Methodology

Achieving a Gapped Platinum System

Now, since platinum is a gapless metal, its highest occupied molecular orbital (HOMO) and lowest unoccupied molecular orbital (LUMO) are degenerate in energy. This degeneracy was problematic for the perturbative framework I wanted to use, but one can numerically introduce a twisted periodic boundary condition to artificially “open a gap” in the platinum system. Thus, in my simulations I sampled the reciprocal space of platinum’s primitive cell to identify the ideal k-point mesh yielding the largest gap. I then used this k-point mesh in all proceeding calculations.

Lattice Spacing Optimization

We optimized the bulk platinum lattice spacings computationally, yielding fair agreement with the benchmark experimental value14 (Table 1). The minimal gth-szv and the more robust gth-dzvp basis sets overestimated and underestimated the bulk platinum lattice constant, respectively. A basis set is a set of functions representing the electronic wavefunction in DFT so that the partial differential equations of the model can be implemented efficiently on a computer.

Table 1: Platinum lattice constants (Å) generated from an equation of state fit on seven points in lattice constant-energy space computed with both the gth-szv and gth-dzvp basis sets at the DFT (done with four functionals) and Hartree-Fock (HF) levels of theory.

Computing Surface Energies

We then calculated platinum (111) surface energies according to Equation 1, with Ebulk and Eslab, respectively the bulk platinum and platinum surface slab energies, computed using the Table 1 bulk-optimized lattice constants. N is the size of the simulated system.

Results and Discussion

Finite Size Effects

Material surfaces are effectively infinite in size. To compute their properties, we make two approximations: the size of the k-point mesh and the number of layers perpendicular to the surface slab sampled must be finite. We then consider the effects of this finite size on computed surface energies by independently varying the number of k-points and surface slab layers sampled in calculations of platinum (111) surface energies.

As expected, we observed that surface energies steadily approach a thermodynamic limit value as both more k-points (Fig. 4ab, 6) and more slab layers (Fig. 5ab) are sampled. We see this from the observed oscillatory convergence in the former case and monotonic convergence in the latter. Furthermore, as expected, a comparison of Figures 4a and 4b shows that surface energies approached in the thermodynamic limit by the more robust gth-dzvp basis are more accurate than those approached by the minimal gth-szv basis.

Figure 4: Surface energies of platinum (111) slabs with 4 layers, computed with DFT at four functionals (Table 1) at the (a) gth-szv and (b) gth-dzvp basis sets, plotted as a function of the number of k-points used. The black line at 2.975 J/m2 represents a benchmark experimental value15.
Figure 5: Percent change in the surface energy of platinum (111) slabs from (a) 3-layer slab surface energies, computed with DFT at four functionals (Table 1 lattice constants, gth-szv basis, 1 k-point) and (b) 4-layer slab surface energies computed with MP2 and CCSD, (Table 1 HF lattice constants, gth-szv/gth-dzvp basis sets, 1 k-point), plotted as a function of the number of slab layers considered.

DFT Performance

It is worth mentioning that we corroborated a trend observed in the literature of DFT’s systemic overestimation of surface stability with respect to experiment5. That is, Fig. 4ab show that DFT consistently yields lower surface energies with respect to experiment when approaching the thermodynamic limit of more k-points sampled.

MP2 Performance

The perturbative MP2 method struggled with the still near energetic degeneracy of platinum’s HOMO and LUMO discussed previously. This is because the 2nd order energetic perturbative correction, the component of the MP2 energy aiming to capture correlations between electrons, contains the difference between molecular orbital energies εi, εj, εa, and εb in its denominator (Equation 2).

The near energetic degeneracy of the HOMO and LUMO makes this denominator go to zero, causing the MP2 energy to go to infinity in a physically unrealistic manner.

We speculate that this degeneracy caused MP2’s poor surface energies relative to DFT, evidenced by a comparison of the results from Fig. 6 and 4b, respectively, at the robust basis set (gth-dzvp) and the largest comparable reciprocal space sampling (9 k-points).

Figure 6: Surface energies of platinum (111) slabs, computed with MP2 (Table 1 HF lattice constants, gth-szv/gth-dzvp basis sets), plotted as a function of the number of k-points used. At data points with 1 or 4 k-points, 4-layer slabs were used; at data points with 3 k-points, 3-layer slabs were used. The black line at 2.975 J/m2 represents a benchmark experimental value15.

CCSD Performance

The CCSD method performed poorly in computing platinum surface energies. Even an assessment of the accuracy of CCSD surface energies was inhibited by trouble with getting CCSD calculations, which rely on an iterative method, to converge numerically. CCSD, which treats correlations between electrons based on the perturbative CC ansatz, struggled to converge numerically due to the still near energetic degeneracy of platinum’s HOMO and LUMO.

We obtained CCSD convergence only using a minimal 1 k-point sampling, though results from such a minute k-point sampling are untrustworthy, customarily yielding surface energies over 10 J/m2 off the experimental value (Fig. 5, 7). Additionally, the CCSD method has a steep scaling in computational cost of O(N6), where N is the size of the simulated system. Thus, computations performed at the robust basis set (gth-dzvp) and modest reciprocal space samplings (16 k-points or more) required for obtaining accurate platinum surface energies proved to be prohibitively expensive.


This study confirmed the inadequacy of DFT for computing accurate surface energies as identified in the literature. Furthermore, we found that even for a maximally gapped platinum system, the perturbative MP2 and CCSD frameworks are unable to compute accurate surface energies. Platinum’s metallic character causes the former method to yield surface energies less accurate than DFT and plagues the latter method with numerical convergence issues. Ultimately, these results show the ineffectiveness of perturbative methods for computing surface energies of metals, like platinum.

A natural direction for future work will be to assess the performance of perturbative methods, including CCSD, for computing surface energies of non-metal catalysts, in which sufficiently large HOMO/LUMO gaps exist. We would also seek to compute other quantities that help elucidate catalytic mechanisms, such as the adsorption energy.


I would like to thank Yang Gao for guiding me through my project on a daily basis in this difficult remote setting and Professor Chan for finding time to discuss my project each week throughout the summer. Lastly, I would like to sincerely thank the John Stauffer Charitable Trust for funding my SURF this summer, enabling me to get deeply involved in research I am passionate about.


  1. Seh, Z. W.; Kibsgaard, J.; Dickens, C. F.; Chorkendorff, I.; Nørskov, J. K.; Jaramillo, T. F. Combining Theory and Experiment in Electrocatalysis: Insights into Materials Design. Science 2017, 355 (6321).
  2. The Nobel Prize in Chemistry 2007 (accessed Feb 19, 2020).
  3. What is Surface Energy? Calculation Models and More Explained (accessed Sep 24, 2020).
  4. Adsorption Energy & Surface Energy | Materials Square Tech Blog.
  5. Schimka, L.; Harl, J.; Stroppa, A.; Grüneis, A.; Marsman, M.; Mittendorfer, F.; Kresse, G. Accurate Surface and Adsorption Energies from Many-Body Perturbation Theory. Nature Materials 2010, 9 (9), 741–744.
  6. McClain, J.; Sun, Q.; Chan, G. K.-L.; Berkelbach, T. C. Gaussian-Based Coupled-Cluster Theory for the Ground-State and Band Structure of Solids. J. Chem. Theory Comput. 2017, 13 (3), 1209–1218.
  7. Booth, G. H.; Grüneis, A.; Kresse, G.; Alavi, A. Towards an Exact Description of Electronic Wavefunctions in Real Solids. Nature 2013, 493 (7432), 365–370.
  8. Liao, K.; Grüneis, A. Communication: Finite Size Correction in Periodic Coupled Cluster Theory Calculations of Solids. J. Chem. Phys. 2016, 145 (14), 141102.
  9. Gruber, T.; Liao, K.; Tsatsoulis, T.; Hummel, F.; Grüneis, A. Applying the Coupled-Cluster Ansatz to Solids and Surfaces in the Thermodynamic Limit. Phys. Rev. X 2018, 8 (2), 021043.
  10. Wang, X.; Berkelbach, T. C. Excitons in Solids from Periodic Equation-of-Motion Coupled-Cluster Theory. J. Chem. Theory Comput. 2020, 16 (5), 3095–3103.
  11. Gao, Y.; Sun, Q.; Yu, J. M.; Motta, M.; McClain, J.; White, A. F.; Minnich, A. J.; Chan, G. K.-L. Electronic Structure of Bulk Manganese Oxide and Nickel Oxide from Coupled Cluster Theory. Phys. Rev. B 2020, 101 (16), 165138.
  12. Shepherd, J. J.; Grüneis, A. Many-Body Quantum Chemistry for the Electron Gas: Convergent Perturbative Theories. Phys. Rev. Lett. 2013, 110 (22), 226401.
  13. The first Brillouin zone of a face centered cubic lattice (accessed Oct 23, 2020).
  14. Properties: Platinum – Properties and Applications (accessed Jul 8, 2020).
  15. Boer, F. R.; Boom, R.; Mattens, W. C. M.; Miedema, A. R.; Niessen, A. K. Cohesion in Metals: Transition Metal Alloys; North-Holland, 1988.

Further Reading

[1] A brief conceptual introduction to catalysis can be found here.

[2] An introduction to basic electronic structure theory of materials can be found in Chapters 1-4 of this reference.

[3] An introduction to coupled-cluster theory and a review of its recent application to materials science is found here.

Are All Fast Radio Bursts the Same?

Author: Lucca S. de Mello
Mentor: Charles L. Steinhardt
Editor: Laura Lewis, Caltech, Cosmic Dawn Center

Abstract: In 2007, astronomers discovered highly energetic, millisecond-long radio waves of extragalactic origin. These Fast Radio Bursts (FRBs) are poorly understood, and there is no definitive explanation for their origin yet. Using a novel machine learning technique, we have made significant progress toward determining whether FRBs can be separated into discrete subclasses, or if all FRBs are instances of the same astrophysical phenomenon. The technique consists of modeling the entire intensity-versus-time curve of these bursts as vectors in a high-dimensional space and clustering these vectors in two dimensions. We expect that grouping FRBs by their physical properties will also group them by their physical origin.


With advances in telescope technology, humanity has recently acquired the ability to observe the sky more
closely, noticing fascinating changes that occur within the span of a millisecond. In 2007, a new class of astronomical object was discovered: fast radio bursts, or FRBs (Fig. 1). FRBs are highly energetic1 pulses in the radio spectrum that last for approximately two milliseconds at most. Due to their very recent discovery, little is known about them. Thus, the purpose of this project is to explore this field by further investigating the origin and properties of FRBs.

The high energy and short duration of FRBs make them an especially intriguing class of astronomical object. The special theory of relativity asserts that causality travels at the speed of light. So, in the span of a millisecond, causality travels approximately 300 km. This implies that the diameter of the source of any signal with a duration of one or two milliseconds – the typical duration of an FRB – is on the order of 300 km or smaller, otherwise a complex (and thus unlikely) synchronization mechanism would be required for the bursts to be so short. Relative to distances on an astronomical scale, the order of hundreds of kilometers is fairly small.

Figure 1: Waterfall plot and light curve of the Lorimer burst2. Named after the astronomer who, in 2007, found it in archived data from 2001, the Lorimer burst was the first fast radio burst to be discovered. The two-dimensional grid plot displays the measured intensity (darker is more intense) over time (in seconds) for each separate frequency channel (in MHz). This is known as a “waterfall plot”. Observe the streak of high intensity right before the 1727-second mark, starting in the higher radio frequencies and moving to the lower frequencies over time. That is FRB010724, the Lorimer burst. By integrating the intensity over all frequency channels, we also get the total intensity-versus-time plot here in the upper right corner. That curve is known as the “light curve” of the burst.

It is also known that this mysterious source, although small, must be associated with some extremely energetic astrophysical process1. We know this because, in spite of their short duration and the cosmological distances they travel before reaching us, FRBs are as bright in the radio frequencies as a galaxy when viewed from Earth [1].

FRB science is a nascent but thriving field of astronomy. Although the dataset is still limited, the host of instruments scattered throughout the world listening for new FRBs will give us an accelerating influx of data, which can help us further understand the nature of FRBs. We have already observed that they originate from the entire sky, not just from directions coplanar with the Milky Way3, implying that most of them are of extragalactic origin4. Additionally, likely candidates for what could be the source of FRBs have been identified; these include neutron stars [2] and black holes. However, due to the very recent discovery of FRBs, there is currently no definitive explanation for what astronomical object is the source of FRBs, and we are currently far away from truly understanding what they are.

FRBs are not the only kind of short-duration, high-energy burst of radiation that has been observed. There are also the gamma-ray bursts (GRBs), which are longer and brighter than FRBs. GRBs have been studied for far longer – since the late 1960s – so they are better understood. Indeed, one important milestone in GRB study has been their separation into two discrete classes, “short” and “long”5. The realization that not all GRBs are the same was important because, by studying these classes separately, we were able to connect each of them with an astrophysical origin: we now know that “long” GRBs are generated from the collapse of massive stars, while “short” GRBs come from neutron stars6. This naturally raises the question of whether or not the situation for FRBs is similar to that of GRBs. Are all fast radio bursts the same?


Despite the important advancements that have been made in the process of understanding FRBs, the key step of classification has not yet been realized. As a result, it has not been possible to connect FRBs to an astrophysical origin – or origins. So the problem we chose to tackle in this research project was that of classifying FRBs.

To this end, we have applied a novel machine-learning technique known as “dimensionality reduction,” which has recently proven successful in solving the long-standing classification problem for gamma-ray bursts (GRBs)7. It was known that gamma-ray bursts appear to have two types – called “short” and “long”, as previously discussed – but it had not been possible to unambiguously separate these two types of objects. For example, some “short” GRBs are actually longer in duration than some “long” GRBs.

Earlier attempts at classifying GRBs took the light curve – the curve of intensity [3] versus time, integrated over all frequency channels – and make a small number of summary statistics (such as the duration of the GRBs) which lend themselves easily to physical interpretation. However, this always resulted in some ambiguous objects, which might have been either short or long, as Fig. 2 exemplifies.

Figure 2: Plot of duration versus spectral hardness for several gamma-ray bursts12. Spectral hardness can be interpreted as a measure of the peak intensity of the burst13. As this figure shows, plotting GRBs by these two summary statistics results in two general clusters, but the separation is ambiguous. Specifically, there is a large number of objects between the two clusters, so this technique is not powerful enough to tell us which subclass contains each ambiguous GRB. It is still a very useful result, as studying the general clusters has lead to the identification of the probable sources for each of them. However, a definitive, unambiguous classification could lead to better answers.

To solve this problem, instead of only using two measurements, all of the data were use. The challenge with this is what is known as “the curse of dimensionality”9. Suppose we want to plot objects in N dimensions, where N is the number of measurements. As N increases, so does the sparsity of the data points, meaning that all objects appear to be dissimilar in many ways. The solution was to use what is known as a dimensionality-reduction algorithm. As the name suggests, dimensionality-reduction algorithms take in a set of high-dimensional vectors that model some dataset, then cluster those vectors in a lower-dimensional space – usually in two dimensions – grouping them in a way that preserves the structure of the high-dimensional data. This means that, in the low-dimensional output space, similar objects stay close in and dissimilar objects stay distant. This idea is illustrated in Fig. 3.

Figure 3: Illustration of what a dimensionality-reduction algorithm does. In this example, images of hand-drawn digits were processed into high-dimensional vectors in some way that encodes their structure. A dimensionality-reduction algorithm was then used to cluster these vectors in two dimensions, with the aim that similar digits stay close, dissimilar digits stay distant, and the classification is unambiguous. As can be observed, images are placed into the same cluster if and only if they represent the same digit. Further, the classification of digits that emerges is unambiguous. Note also that the distance between clusters reflects the dissimilarity between the corresponding digits.

The result was Fig. 4, which shows two discrete categories of GRBs, with every object unambiguously classified as either short (orange) or long (purple). Since the problem of classifying GRBs appears to be similar to the problem of classifying FRBs, the specific goal of this project was to make progress on the latter by applying a promising technique that worked on the former. That is, our ultimate goal was to get something similar to Fig. 4, but with FRBs instead of GRBs [4]. Failure in doing so would indicate that either:

  1. FRBs are, in essence, all instances of the same astrophysical phenomenon.
  2. FRB subclasses do exist but may not be easily revealed by dimensionality reduction alone.
  3. Unlike GRBs, FRBs can only be classified in a continuous spectrum.
Figure 4: Result of applying a dimensionality-reduction algorithm to the GRB dataset7. Two discrete categories of GRBs emerged, with every single object unambiguously either short (orange) or long (purple).


As previously discussed, clustering the FRBs by representing each burst as a collection of summary statistics has not been a successful approach. So, instead, our technique consists of clustering the FRBs by the “similarity” of their light curves. For example, some FRBs have light curves that end just as abruptly as they start, while others have an exponential fallout tail after their maximum, like the burst “rings” for a period of time after peaking. Another component of this intuitive notion of similarity is that we are only concerned with the shape of the light curve; that is, if a light curve is simply a scaled-up version of another, we consider them to be similar. This is because the scaled-up (but otherwise similarly-shaped) light curve is simply a more intense version of the latter curve. Intensity should not be a factor because FRBs that appear to be more intense may simply be closer to us, so letting the bursts’ intensity affect their classification could lead to the FRBs being grouped by their sources’ proximity to Earth. Proximity is neither an intrinsic property of the FRB nor of its source, so such a clustering would be physically meaningless. For more details on the similarity of FRB light curves, see Fig. 5.

Figure 5: Waterfall plots and light curves of several FRBs8. Notice how some FRBs end abruptly (e.g. FRB170107, top left) while others end with a fallout tail (e.g. FRB180110, row 3, column 4). Some FRBs even have a secondary “mini-burst” after their peak (e.g. FRB180119, row 3, column 5). Some (or all) of these distinctive topological features may be important in the classification of FRBs. Comparing the vertical axes of these light curves, one also notices that some light curves are simply scaled-up versions of another, even if by a large factor. Since these factors are a proxy for different distances, the light curves must be normalized by peak intensity before our algorithm can be applied.

Quantitatively, we can encode the similarity of these light curves by processing the light curves into vectors in a way that:

  1. Uses the raw light-curve data instead of summary statistics, making no assumptions on how strongly each physical measurement of the FRBs should influence the classification. Consequently, the clustering process is dictated by the raw data and nothing else.
  2. Encodes the entire light curve. In practice, this means that it must be possible to reconstruct the light curve of an FRB from the vector representing it. In other words, the process of transforming FRBs into vectors must be reversible. As a result, our approach has perfect information on the light curves being clustered.
  3. Ensures that the distance between two vectors is indeed a measure of dissimilarity in topology, or shape, of the corresponding light curves.

Since each vector stores all the information we need to reconstruct the corresponding light curve, the vectors are in a very high-dimensional space. Vectors are something we can naturally cluster, so it would stand to reason that the next step is to simply apply any clustering algorithm to the result. However, as discussed in the previous section (Goals), this would not give satisfactory results because of the “curse of dimensionality,” so we must apply a dimensionality-reduction algorithm.

The dimensionality-reduction algorithm we chose is the t-distributed stochastic neighbor embedding, or t-SNE. t-SNE assigns each high-dimensional data point to a temporary position in the low-dimensional space, called a “map point” [5], and gradually updates these positions in a way that attracts similar objects to each other and repels the objects that are dissimilar.

One physical analogy for how this is accomplished is as follows. First, the map points are connected to each other by springs. The smaller the distance between two high-dimensional data points, the stiffer the spring connecting their map points. As a result, map points are attracted if they are far apart but their corresponding objects are close. If two map points are close while their corresponding objects are dissimilar, they are repelled. Then, this system evolves until it reaches equilibrium. The equilibrium state associates each data point to its position in the low-dimensional space. For an illustration of what would result from this, refer again to Fig. 3 as the dimensionality-reduction algorithm used to compute that example was, in fact, t-SNE.

The example of classifying hand-drawn digits also illustrates how t-SNE preserves the macroscopic structure of the objects being clustered: fours look like nines, so those two clusters end up together; nines look like sevens, who
look like ones, etc. This structure-preserving property is essential for our goal of clustering FRB light curves by topological similarity, which is why we chose t-SNE for this project. Additionally, the previously-discussed success in unambiguously clustering GRBs (Fig. 4) was also accomplished using t-SNE, further motivating our choice of dimensionality-reduction algorithm.


After fitting t-SNE to the processed, high-dimensional data and calibrating the algorithm’s parameters, we have obtained Fig. 6 as our program’s final output. As our figure illustrates, we have separated twenty-one FRBs into two groups: one group containing seven FRBs and the other containing fourteen. Recall that our goal was to obtain something similar to Fig. 4, but with FRBs instead of GRBs. In Fig. 4, the physical interpretation of the two subclasses is known: the orange cluster represents “short” GRBs, and the purple cluster represents “long” GRBs. Indeed, the short/long separation of GRBs was known before t-SNE was applied to GRBs (Fig. 2). Unlike GRBs, however, FRBs are still too shrouded in mystery for us to make a probable physical interpretation of the clusters in our result.

Figure 6: The twenty-one FRBs whose raw data we had access to, clustered into two groups. The axes resulting from t-SNE have no physical interpretation or units; only the structure is meaningful.

There is a significant variation in the measured intensity of FRBs. As previously discussed, while an FRB’s higher measured intensity may be an intrinsic property of the burst, it may also result from that FRB having originated closer to Earth. So, to ensure that our program is not biased by FRB intensity – that is, to show that we are not clustering FRBs by their proximity to Earth – we have also produced Fig. 7-8, in which FRBs are colored by peak intensity.

If our algorithm were biased by intensity, the resulting plot would have been, at least approximately, a spectrum from the “low-intensity” colors to the “high-intensity” colors. Instead, these figures show a mostly diverse distribution of FRBs, with both clusters containing bursts in the mid- and low- intensity ranges. The very few bursts that are in the high-intensity range are in the same cluster, but the previous observation – coupled with the fact that the light curves were normalized by peak intensity – leads us to believe that this is most likely not indicative of bias. Of course, a larger sample size is necessary for any definitive assertions on bias, or lack thereof, to be made.

Our program and the clustering we have obtained from it are both preliminary, but they are the first of their kind. Crucially, the quality of the classifications it produces will only increase with the amount of data it is given. With the global community of radio astronomers currently listening for FRBs, the amount of available FRB data is certain to grow in an accelerating fashion. We believe that, if our program is adapted to make full use of this progressively larger volume of data, it will be a significantly useful tool in solving the puzzle of FRBs.

Figure 7: Fig. 6 but colored by peak intensity, capped at 3 Jy. The only FRB in this plot whose peak intensity exceeds 3 Jy is FRB150807 (bottom right). It is the most luminous FRB ever recorded, peaking at 128 Jy, and is the reason why this plot’s coloring is capped.
Figure 8: Fig. 7 without the intensity cap, together with the light curve of FRB150807. Time is in seconds and intensity is in Jy. All FRBs appear to heave the same peak intensity when next to FRB150807 due to the latter’s abnormally high peak intensity.


We have developed a program that, given a standardized set of raw FRB data, clusters the FRBs by the similarity of their light-curves. Our program is dictated by the raw data alone, making no assumptions on how strongly each physical measurement of the FRBs should influence the classification. It has perfect information on each light curve being clustered, and everything indicates that it is not biased by each source’s proximity to Earth. Our program employs t-SNE, an algorithm that was successful in a problem very similar to that of classifying FRBs. The motivation for doing so is that we expect that grouping FRBs by their physical properties will also group them by their physical origin.

Our preliminary results imply that our program for processing and grouping FRB light-curves is a promising candidate as a solution to the challenge of classifying FRBs. Further, it implies that more definitive classification results obtained by running our program on a larger dataset will likely be unbiased with respect to proximity.

While our result does represent a necessary first step in the challenge of classifying FRBs, the problem has not
been solved. Due to several technical factors related to the very recent discovery of FRBs, we only had access to
a small sample size. One of the most difficult parts of studying FRBs is that the raw data is not easily accessible, as
there is no publicly-available catalog of raw FRB data yet [6]. Additionally, raw FRB data is stored in a large variety of file formats; if there were a standard for storage of raw FRB data, classification studies would probably advance much faster. For these two reasons, conclusive further research in this area will likely required the creation of a catalog of raw FRB data that uses a standardized file format [7].

Moreover, the algorithm developed in this project can only produce meaningful results if the dataset uses the same sampling period across all FRB recordings. This means that most datasets are not compatible with each other; some datasets, such as the one used in this project, are not even consistent within themselves. And, as previously discussed, FRB data is still limited. So, to obtain more definitive classification results through our algorithm, further research may have to wait until there is a larger collection of FRB observations with practically identical time resolutions.

Are all fast radio bursts the same? We do not know, and it may still take time until it is possible to answer this
question. But we have developed a tool that can be very useful in doing so; all we need is more data. We hope that,
with a large, standardized, and easily accessible catalog of raw FRB data, the use of machine learning in the classification of FRBs will prove to be a powerful way in which the nature of these mysterious objects can be better understood.


This research project is a part of, and was made possible by, the SURF program. I thank the Caltech Student-Faculty Office for their support, and I thank my mentor, Charles L. Steinhardt, for his help in executing this project. I think Duncan Lorimer for helping us obtain raw FRB data.


The data and code used in this research can be found in this Github repository.

For this project, twenty-seven plaintext files of raw FRB light-curve data were acquired with the help of Dr. Duncan Lorimer. So that the light curves may be meaningfully compared, the twenty-one files (of those twenty-seven) whose data used the mode sampling period (0.97 ms) were selected. These 21 FRBs are in the “data” directory of the linked repository. The other six are in the “bad data” directory, which is not used by the program. The table below shows which 27 FRBs were used in this project, ordered by when they occurred [8]. The first 21 FRBs in the table were recorded with practically the same sampling period, and therefore can be meaningfully compared using our program. Accordingly, these twenty-one FRBs are the ones in Fig. 6-8.

FRBSampling Period


  1. L. Billings. A Brilliant Flash, Then Nothing: New Fast Radio Bursts Mystify Astronomers (2013).
  2. D. Lorimer. A bright millisecond radio burst of extragalactic origin (2007).
  3. M. Caleb et al. The first interferometric detections of Fast Radio Bursts (2017).
  4. M. Kiyoshi et al. Dense magnetized plasma associated with a fast radio burst (2015).
  5. C. Kouveliotou et al. Identification of Two Classes of Gamma-Ray Bursts (1993).
  6. E. Nakar. Short-Hard Gamma-Ray Bursts (2007).
  7. C. Jespersen et al. An Unambiguous Separation of Gamma-Ray Bursts into Two Classes from Prompt Emission Alone (2020).
  8. R. M. Shannon et al. The dispersion-brightness relation for fast radio bursts from a wide-field survey (2018).
  9. R. E. Bellman. Adaptive control processes: a guided tour (1961).
  10. D. Michilli et al. An extreme magneto-ionic environment associated with the fast radio burst source FRB 121102 (2018).
  11. C. D. Bochenek et al. STARE2: Detecting Fast Radio Bursts in the Milky Way (2020).
  12. J. P. U. Fynbo et al. The optical afterglow of the short gamma-ray burst GRB 050709 (2005).
  13. A. Shahmoradi. Hardness as a spectral peak estimator for gamma-ray bursts (2010).
  14. Pedregosa et al. Scikit-learn: Machine Learning in Python (2011).


[1] Fig. 5 of On the Normalised FRB Luminosity Function shows the luminosity function of FRBs, i.e. the number of FRBs per luminosity interval. Comparing it to the luminosity function of galaxies shows that, in general, the luminosity of FRBs is indeed similar to that of galaxies.

[2] Some FRBs are polarized, which indicates that at least those FRBs were emitted from a specific kind of neutron star called a magnetar10. Indeed, the first FRB to have been linked to a known source, FRB200428, was found to originate from a magnetar11.

[3] By “intensity,” we mean Spectral flux density, measured in Janskys (Jy).

[4] Though we did not assume that the number of FRB subclasses, if they exist, is only two.

[5] Map points are computed using probabilistic methods beyond the scope of this paper.

[6] There is, a publicly-available catalog of FRB data, but it does not store raw FRB data; it is limited to summary statistics. While these are useful in general, using summary statistics has not been a fruitful strategy in the challenge of classifying FRBs.

[7] Or, alternatively, the extension of into one such catalog.

[8] Not necessarily when they were discovered; many FRBs, including the first one ever found, are discovered from old archival data.

Study of Low Mass Long Lived Particles Decaying in the Muon System

Author: Gabrielle Dituri
Mentors: Maria Spiropulu, Christina Wang, Cristian Pena, and Si Xie
Editor: Stephanie Chen

The God Particle and Why it is Troublesome

The Higgs boson (nicknamed the “God Particle”) was discovered in 2012 at the Large Hadron Collider (LHC) by ATLAS and CMS1. This nickname arrived due to a hilarious publishing decision in which the author of a book about the Higgs particle decided to shorten the original title (“the Goddamn Particle”) into a title that would appeal to a wider range of audiences (“the God Particle”). The original title was named to comedically highlight the difficulty in finding proof of the particle’s existence, which took about 50 years and an expensive particle accelerator. Despite this decision to shorten the name, the God Particle nickname accurately emphasizes the importance of the Higgs discovery to modern physics. The discovery of the particle provided credence to the Standard Model (SM), which is a model describing all the elementary particles and their interactions. The Higgs is an elementary particle previously predicted by the SM, but never before seen until 2012. The discovery of the particle confirmed that the Standard Model was able to correctly predict the existence of a particle before any data was collected. This is extremely important to particle physics as this shows that unknown particles can be predicted before we have the technology to find physical proof of them. According to the Standard Model, the Higgs can decay into fermions and massive gauge bosons (Fig. 1).

Higgs boson and its role in the Standard Model
Figure 1: The standard model2. This shows the standard model particles, including quarks and leptons.

The SM dictates that the Higgs boson decays into SM particles including fermions, which are elementary particles with half odd integer spin (1/2, 3/2, etc.) that obey the Pauli Exclusion Principle.  However, we are looking to provide credence to an alternative theory3 which  suggests that the Higgs decays first into a pair of long lived particles (LLPs). These particles then decay into SM particles which are measured in the Cathode Strip Chamber (CSC), the endcap Muon System in CMS. The Higgs has a 20% probability of decaying into invisible particles, such as neutrinos or beyond the standard model (BSM) particles. Therefore, there is a possibility that the Higgs can decay to LLPs, which then decay back to decay modes that can be detected, like fermions4. LLPs are BSM particles, meaning they are extremely important to study as they could help solve many current problems in physics, such as the matter/antimatter asymmetry problem5. My research aims to verify the theory that the Higgs decays into LLPs by looking for this signature of the LLPs decaying back to SM particles in the CSC system. The decayed LLPs generally create a large cluster of RecHits (reconstructed hits in the detector) in the CSC thus allowing us to accurately study them.  Furthermore, we would like to understand if the number of RecHits are related to mass and the decay mode of the LLPs, and how the distribution is different from that of background. The background mostly consists of particles such as pions and kaons that are produced at the interaction point (the point where the protons collide) instead of particles produced from the displaced decays. 

Methods and Results

For this study, we utilized 10 simulated samples (Fig. 3, col. 1). Here, k is a kaon, pi is a pion, d is down quark, τ is tau lepton, e is an electron, and b is bottom quark. These particles were used as they represent a broad spectrum of particles, including some charged/neutral, quarks/leptons, etc.

Acceptance vs Lifetime

We were able to plot a graph of acceptance (the number of events that have at least 1 LLP decay in CSC) vs proper lifetime (Fig. 2). The larger the number of signal events, the more sensitive the detector is to a decay. From this graph, we could clearly see that the most sensitive lifetime is highly dependent on the LLP mass and that the Muon System is sensitive to low mass LLP with shorter lifetime.

Figure 2: Acceptance vs. Lifetime. We say that most particles that come from an LLP of mass 1 GeV have their lifetime peaks at about 50-100 mm.

Acceptance and Clustering Efficiency

We then calculated the acceptance for events pertaining to LLPs with differing lifetimes (Fig. 3). We also looked at how many of those events had a cluster matched to LLP, where a cluster has 50 or more RecHits. The cluster matched simply means that a cluster was produced from a specific LLP.  By analyzing these two percentages, we were able to conclude the Clustering Efficiency in RecHitCluster. The Clustering Efficiency is defined as the number of events with a cluster matched to LLP/ number of events that have at least 1 LLP decay in CSC. This is important as it tells us if the formation of clusters is related to the LLP mass and the type of particles that they decay to.

Decay Particles, ms (GeV), cτ (mm)% Acceptance% Events with a Cluster Matched to LLPClustering Efficiency (%)
ee, 0.5, 5003.040.9029.50
ee, 1, 10015.655.0031.94
ee, 1, 5006.591.9529.67
pi0pi0, 1, 10015.655.0031.98
pi0pi0, 1, 5006.581.9930.18
pi+pi, 1, 5006.554.0561.90
k+k, 1.5, 5008.805.7064.83
dd, 1, 10003.741.9652.43
ττ, 7, 100014.156.4545.58
bb, 15, 100016.3511.9473.04
Figure 3: Table of percentages for each decay particle. Although the Clustering Efficiency ranged from about 30%-73%, all Efficiencies are on the same order of magnitude. Thus, we saw that there was not an extreme difference between how many events formed clusters for all the samples.

Relationship Between Number of Events and Mass

We wanted to find the relationship between the number of events/hits and the lifetime and mass of the LLP the particles decayed from. This gave us insight into how sensitive the detector is concerning these factors. One of the ways we did this was by analyzing the relationship between events and X Spread (how much the clusters are spread in the X direction of the detector) , Y Spread (how much the clusters are spread in the Y direction of the detector), N Station (station in the detector where the hits are detected), Cluster Size (number of RecHits in the clusters), pseudorapidity of the cluster (represents the angle of the particle relative to the z axis), and Avg Station (average of the station number of the RecHits, 4 stations for the CSC).

Based on the table (Fig. 3) and the graphs above, the initial conclusion was that the detector is sensitive to the branching ratio (the ratio of Higgs decaying to LLP) of similar order of magnitude for LLPs with different masses. We concluded this because the Clustering Efficiency and the RecHits distribution are similar for the different signal samples. Additionally, we can see that the detector is sensitive to decay particles since pi+pi and pi0pi0 have extremely different Clustering Efficiencies despite being produced from LLPs with the same mass and lifetime.

Clustering Efficiency and Geometry

We plotted the Clustering Efficiency with respect to several factors, including the LLP decay position and the energy of the LLP. This shows us how the Clustering Efficiency varies in the detector and which showers penetrate the stations (Fig. 10-12).

The LLP decay vertex Z/R plots clearly show the shape of the detector: when the LLP decays in the sensitive detector volume the Clustering Efficiency increases. Fig. 13 directly demonstrates why the Clustering Efficiency fluctuates – the iron/steel in the detector prevents a high Clustering Efficiency while the sensitive detector region encourages a high Efficiency. There also seems to be a small increase of Clustering Efficiency with LLP energy, but it is not highly dependent. As all these graphs exhibit the same general shape, we concluded that the detector is sensitive to different particles. More specifically, we can see from Fig. 10-12 that Clustering Efficiency is dependent on the type of shower – a particle shower is a cascade of particles produced from a particle collision. Hadronic shower particles like bb tend to have a higher Clustering Efficiency, and electromagnetic (EM) shower particles like ee tend to have a lower Clustering Efficiency. Because the Clustering Efficiency is higher for hadronic shower particles, we know that the level of sensitivity is different depending on the type of shower.

Figure 13: Diagram Showing the Muon System Geometry. This diagram paired with the Clustering Efficiency plots is very useful in showing why the Clustering Efficiency increases/decreases in the detector based on the geometry. Z (m) is on the X axis, R (m) is on the Y axis, and η (pseudorapidity) is on the Z axis.

2D Clustering Efficiency

Finally, we wanted to analyze the 2D Clustering Efficiency plots, which would help to further show the difference between hadronic and EM showers. We knew from the previous graphs that the Clustering Efficiency changed according to the type of shower due to the geometry of the detector (Fig. 13). However, we wanted to visualize the dependence on detector geometry in both the R and Z direction of the detector. Below are the 2D Clustering Efficiency graphs, with the red boxes representing the detector volume.

In a Nutshell (more like Particle Detector)

In summary, we saw that the Muon System is sensitive to all LLP decay modes – ee, bb, dd, ττ, pi0pi0, pi+pi, k+k, but that the level of sensitivity is different between these decay modes. Also, our preliminary study shows that Clustering Efficiency and Cluster Size are correlated with the LLP energy and momentum. Furthermore, we found that hadronic showers have more penetration than EM showers. Most importantly, we found that the Muon System is sensitive to LLPs with mass as low as 0.4 GeV; this is the first study at the CMS to set limits on the sensitivity of low mass LLPs.

Why Does This (anti)Matter?

Dark matter may be comprised of these BSM particles which is why this field is important to study: dark matter makes up most of the universe, and we still do not know exactly what it is. In the future, I hope to continue my research on this topic and find the exact level of sensitivity to LLP mass and decay mode. I additionally hope to research the lowest mass that the detector is sensitive to, which would help in looking for BSM particles. In general, conducting studies like this one, not bound by the SM, brings us physicists closer than ever to finding a unifying theory to describe the world around us.


Thank you so much to my SURF supervisor, Maria Spiropulu. Without her mentorship, I never would have undertaken such an amazing project. Also thank you to Cristian Pena and Si Xie who constantly reviewed my work and gave me great feedback. Special thanks to Christina Wang who guided me through every aspect of the project and helped me realize my passion for this field!


  1. Gray, Heather, and Bruno Mansoulié, “The Higgs Boson: the Hunt, the Discovery, the Study and Some Future Perspectives.” ATLAS Experiment at CERN, 4 July 2018.
  2. Particles of the Standard Model of particle physics (Image: Daniel Dominguez/ CERN).
  3. Craig, Nathaniel, et al. “Naturalness in the Dark at the LHC.” Journal of High Energy Physics, vol. 2, 23 Mar. 2015, doi:10.1007/jhep07(2015)105.
  4. Blas, J. De, et al. “Higgs Boson Studies at Future Particle Colliders.” Journal of High Energy Physics, vol. 1, 9 May 2019, doi:10.1007/jhep01(2020)139.
  5. The Matter-Antimatter Asymmetry Problem. CERN,