Author: Rupesh Jeyaram
Mentor: Christian Frankenberg
Editor: Ishani Ganguly
Scientific research relies on the production and analysis of data. In some domains, this data may consist of a handful of measurements or hand-written observations in notebooks. In other fields, data may be in the form of large Excel files and documents. In geoplanetary studies, satellite data is commonly stored in the NetCDF file format. Such file-centered data storage has been prominent for decades. However, we are currently experiencing improvements in satellite technology that demand exponentially larger disk space to hold hyper-spectral, high-resolution data.
The currently-used file-based approach to data storage functions correctly, but there are major disadvantages that limit scientific inquiry. The major problem lies in the inefficiency of using files to query spatial information. For instance, if we wanted to find all points in the dataset that fall within Los Angeles, we would traditionally use a “file-crawling” approach. This consists of looking at each entry in each file and determining whether the point falls in Los Angeles or not, and then assembling an appropriate data subset. This is becoming computationally infeasible with the current volume of incoming data. To make matters worse, the data is inaccessible, difficult to parse, and impossible to visualize without prior experience, making this scientific information exclusive to a small community of specialized researchers.
Here we address these issues by creating a spatially indexed database to efficiently store and access data from the TROPOMI dataset. Specifically, we developed a Postgres database with the PostGIS spatial extension to store global measurements of Solar Induced Fluorescence (SIF). By utilizing an R-Tree index to spatially index the TROPOMI satellite data, we can rapidly extract spatially confined datasets, enabling new forms of geoplanetary research. This type of spatial database can be generalized and used in other research scenarios with massive spatial-data volumes.
TROPOMI Data and Postgres Spatial Database
While there are many datasets from Earth-facing satellites, we used data from the TROPOMI instrument, due to its relevance to the GPS carbon systems labs. TROPOMI is a spectrometer that monitors atmospheric ozone, methane, formaldehyde, aerosol, carbon monoxide, NO2 and SO2. Using the instrument’s spectral images, researchers in the Frankenberg lab retrieve SIF measurements. SIF provides key information on the total amount of light absorbed by plants during photosynthesis, and thus can be used to estimate useful parameters such as ecosystem health, cropland activity, and climate change impacts.
Postgres is a powerful, open-source object-relational database management system (OR-DBMS) that is used in a wide range of industry and data science applications. It has been maintained and updated since its first public release in 1996, making it one of the most advanced and feature-rich database options for scientific research.
While Postgres provides the structural framework for data storage, PostGIS is the Postgres extension that enables support for spatial data within a Postgres database. PostGIS allows researchers to directly filter the dataset according to a datapoint’s footprint parameters — most usefully, the footprint’s relationship to a specified geometry (ex. inside, outside, or within x meters from a shape). In addition, spatial indexing of the TROPOMI dataset allows users to retrieve spatially-grouped data points very quickly. Instead of looking through every row of every NetCDF file to find the subset of points within a small geographic region, the database immediately zooms in to the relevant data using the spatial index.
The primary benefit of using a database is the option to use a database index. A database index is a data structure that can be used by a database to improve the speed of data retrieval operations. More specifically, it is an efficiently-stored and easily traversable copy of select columns of data from a table. Each row of the index provides a direct link to the complete original record. This way, the entire data table can be traversed efficiently according to a subset of columns while still preserving all attributes contained in the dataset. Generally, databases provide a speedup from O(N) to O(logN) for querying indexed data.
Results and Analysis
We have loaded more than 1 billion individual data points from the SIF TROPOMI dataset into a Postgres/PostGIS spatial database. Loading this entire set (with spatial and temporal indexing enabled) took approximately 24 hours. This is a tractable duration for the transferring and indexing tasks, given the tremendous volume of data. After ingesting this data and building the indices, retrieval based on specified spatial and temporal constraints is very fast. Using the spatial database, extracting all the data that falls within a specific U.S. county takes less than 1 second over all the available data. As the total data volume increases, this search time increases as O(log N) where N is the total number of records.
Using a file-crawling script, it would have taken approximately 2 seconds to extract a single day’s records that are within a specified county. We take a O(d·nave) time complexity, where d is the number of days and nave is the average number of records per day. Given there is approximately 500 days’ worth of data, it would have taken approximately 2 seconds/day * 500 days = 1000 seconds to extract all the data that falls within a specific U.S. county using. This presents a nearly 1000x speedup by using a database, which would scale even more generously as data is added.
Furthermore, a sub-second retrieval time enables real-time interactivity. A satellite data visualization tool was developed using Python and Bokeh, a visualization library that meets our needs by providing easy-to-configure interactive plots (Figure 1). In conjunction with the spatial databases’s rapid data extraction ability, the visualization tool enables researchers to examine trends in data quickly before performing more rigorous analyses.
Using the database, researchers have been able to examine the devastating effects of the 2019 midwest floods on a delayed crop season (Figure 2). At least 1 million acres of U.S. farmland, in nine major grain-producing states have flooded, forcing farmers to replant or delay their planting season. Such a delayed season would be reflected in SIF measurements: major activity in the beginning of the 2018 crop season would not be seen in the same time period in 2019. That is, ΔSIF (SIF2019 – SIF2018) would be negative.
As predicted, ΔSIF is negative in the corn belt. The spatial database rendered this analysis extremely straightforward; a single query produced the above results in a fraction of the time that a file-crawling script would have taken. While the performance of the two methods should not be compared numerically due to fundamental differences in the two approaches (i.e. pre-computation vs. retrieval), the spatial database is faster for general-purpose retrieval, the metric we are most interested in. Additionally, the spatial database can calculate precise datapoint-in-county tests, whereas most scripts would otherwise calculate an approximation using gridded data and the fraction of each grid in a county. This type of analysis would not have been possible without a spatial database.
As planned, a Python package was assembled to provide a fast and natural way to interact with the database. Using this package, researchers can directly pull data into their experiments, allowing them the freedom to manipulate the results in any way desirable.
As satellite data sets become substantially larger, it will be paramount to store and access the data in an efficient way. Not only will this tool speed up research tasks as shown, it will fundamentally unlock new types of scientific inquiry. This project implemented a spatial database that sped up scientific queries exponentially. We loaded and indexed more than a billion data points in the database. While loading the data takes approximately 24 hours, extracting county-sized datasets takes under a second. In demonstrating the potential for spatial databases in a rich dataset such as TROPOMI, we provide evidence that such databases can greatly improve research speed and versatility. By creating a generic Python utility package, we establish the database as a convenient and researcher-friendly tool for big data analysis.
There are many future directions for integrating spatial databases with satellite data. For instance, high-resolution, static topographic datasets can be merged with satellite imagery using spatial databases to inspect the effects of terrain on retrieved data values. With a spatial database, terrain-satellite computation can be performed much faster, allowing this type of script to run in a fraction of the time as it would have taken using traditional approaches. Researchers can also build more generalizable tools to ingest data, install GPUs and enable Postgres-GPU processing to improve performance, and find more intuitive ways to visualize different datasets on the same screen. The findings of this study demonstrate that there is a strong foundation for research scientists to explore data in more efficient ways using spatial databases.
- Köhler, Philipp, et al. “Global Retrievals of Solar‐Induced Chlorophyll Fluorescence With TROPOMI: First Results and Intersensor Comparison to OCO‐2.” Geophysical Research Letters, vol. 45, no. 19, 2018, doi:10.1029/2018gl079031.
- “The World’s Most Advanced Open Source Relational Database.” PostgreSQL, http://www.postgresql.org/.
- Aji, Ablimit, et al. “Hadoop GIS.” Proceedings of the VLDB Endowment, vol. 6, no. 11, 2013, pp. 1009–1020., doi:10.14778/2536222.2536227.