Open Science Grid (OSG) school 2024 travel award and BLSP

Aug 4, 2024·
Xiaojie Gao (高孝杰)
Xiaojie Gao (高孝杰)
· 4 min read

Background and the computing challenge

Vegetation phenology, the timing of life-cycle events in vegetation such as leaf development and senescence, plays an important role in understanding ecosystem dynamics and climate interactions. It provides valuable insights into how plants respond to environmental factors such as temperature, precipitation, and day length, making it a key indicator of climate change impacts. Phenology also serves as the primary control of the duration of the growing season and photosynthesis, which directly affect ecosystem carbon, water, and energy cycles.

In contrast to collecting vegetation phenology data on the ground by routinely surveying individual trees, which is spatially limited and labor-intensive, satellite remote sensing-based land surface phenology (LSP) consistently tracks terrestrial vegetation over large spatial areas, substantially increasing our ability to monitor phenology. However, until recently, the LSP products scientists used to understand continent-wide and global changes in plant life cycles either had coarse spatial resolution (e.g., a pixel corresponding to a ground area of 500 x 500 m²) or were available only for a few recent years, meaning that important finer-scale spatial patterns and longer-term changes could be missed. To address this issue, I developed a novel Bayesian land surface phenology (BLSP) model to produce the first long-term (1984-present), gap-free, remote sensing-based phenology data with pixel-wise uncertainty estimation at 30 m spatial resolution. Compared to existing LSP data, BLSP not only covers longer periods but also better matches the spatial scales of many ground-collected ecological data.

However, producing BLSP maps requires substantial computing resources because satellite images are large in data volume, and the BLSP model relies on the computationally intensive Monte Carlo Markov Chain (MCMC) technique. Even though I have been using parallel processing on the high-performance computing (HPC) facility provided by our university, mapping BLSP for a region such as Madison city would take several weeks, not even including the time required for data downloading and preprocessing. Computing power is the bottleneck in applying BLSP to large geographic regions.

The solution with HTC and OSPool

At the OSG School 2024, I learned that high-throughput computing (HTC) technology, along with the Open Science Pool (OSPool) computing resource, might solve my BLSP computing problem. Specifically, since HTC excels at parallel processing millions of small, independent computing tasks, and OSPool provides distributed computing capacity from dozens of institutions rather than relying on a single HPC environment, parallel processing BLSP for millions of independent pixel time series may be an ideal use case for HTC and OSPool.

I have designed a BLSP processing workflow using HTC and OSPool:

the workflow

  1. Satellite image data downloading: To utilize parallel processing for data downloading, the workflow splits the study period into individual years and submits a data downloading job for each year. After downloading the data, the computing node performs preprocessing on the satellite images and transfers the output images to the Open Science Data Federation (OSDF) directories.
  2. Data chunking: In this step, the workflow merges all years of data into a 3D image array, where the first two dimensions represent satellite images, and the third dimension represents the sorted image collection dates. Then, the workflow splits the 3D image array into smaller chunks based on the first two dimensions, meaning each chunk contains a smaller image while maintaining the entire time period. The chunks are indexed by their location to facilitate merging later.
  3. BLSP processing for each pixel time series: This step submits a job array to the OSPool, where each job processes an image chunk. The jobs perform BLSP processing for each pixel time series and return a BLSP image chunk, which is also a 3D image array containing multiple annual phenological metrics. After processing, the computing nodes also transfer the results back to the OSDF directories.
  4. Combining BLSP chunks into maps: After all chunks are processed, a single-core job is submitted to combine all BLSP chunks into maps, with each annual phenological metric represented as a single layer. The results are also transferred to the OSDF directory.

Currently, I have successfully implemented the data downloading step using OSPool and am working on automating the rest of the workflow. Since the computing nodes are distributed across multiple machines with independent internet connections, distributed parallel downloading of data using OSPool has reduced the processing time from days to hours. Although the entire workflow is still under development, I have tested the feasibility of individual steps and successfully processed a BLSP map for the Madison city in 2021 as an example (Fig. 2). My goal is to fully automate the workflow using the Directed Acyclic Graph Manager (DAGMan), enabling the production of BLSP maps for a study region with just one manual job submission.

madison_blsp_sos_2021