Reading ICESat-2 Data in for Analysis

This notebook (download) illustrates the use of icepyx for reading ICESat-2 data files, loading them into a data object. Currently the default data object is an Xarray Dataset, with ongoing work to provide support for other data object types.

For more information on how to order and download ICESat-2 data, see the icepyx data access tutorial.

Motivation

Most often, when you open a data file, you must specify the underlying data structure and how you’d like the information to be read in. A simple example of this, for instance when opening a csv or similarly delimited file, is letting the software know if the data contains a header row, what the data type is (string, double, float, boolean, etc.) for each column, what the delimeter is, and which columns or rows you’d like to be loaded. Many ICESat-2 data readers are quite manual in nature, requiring that you accurately type out a list of string paths to the various data variables.

icepyx simplifies this process by relying on its awareness of ICESat-2 specific data file variable storage structure. Instead of needing to manually iterate through the beam pairs, you can provide a few options to the Read object and icepyx will do the heavy lifting for you (as detailed in this notebook).

Approach

If you’re interested in what’s happening under the hood: icepyx turns your instructions into something called a catalog, then uses the Intake library and the catalog to actually load the data into memory. Specifically, icepyx creates an Intake data catalog for each requested variable and then merges the read-in data from each of the variables to create a single data object.

Intake catalogs are powerful (and the tool we selected) because they can be saved, shared, modified, and reused to reproducibly read in a set of data files in a consistent way as part of an analysis workflow. This approach streamlines the transition between data sources (local/downloaded files or, ultimately, cloud/bucket access) and data object types (e.g. Xarray Dataset or GeoPandas GeoDataFrame).

Import packages, including icepyx

import icepyx as ipx

Quick-Start Guide

For those who might be looking into playing with this (but don’t want all the details/explanations)

path_root = '/full/path/to/your/data/'
pattern = "processed_ATL{product:2}_{datetime:%Y%m%d%H%M%S}_{rgt:4}{cycle:2}{orbitsegment:2}_{version:3}_{revision:2}.h5"
reader = ipx.Read(path_root, "ATL06", pattern) # or ipx.Read(filepath, "ATLXX") if your filenames match the default pattern
reader.vars.append(beam_list=['gt1l', 'gt3r'], var_list=['h_li', "latitude", "longitude"])
ds = reader.load()
ds
ds.plot.scatter(x="longitude", y="latitude", hue="h_li", vmin=-100, vmax=2000)

Key steps for loading (reading) ICESat-2 data

Reading in ICESat-2 data with icepyx happens in a few simple steps:

  1. Let icepyx know where to find your data (this might be local files or urls to data in cloud storage)

  2. Tell icepyx how to interpret the filename format

  3. Create an icepyx Read object

  4. Make a list of the variables you want to read in (does not apply for gridded products)

  5. Load your data into memory (or read it in lazily, if you’re using Dask)

We go through each of these steps in more detail in this notebook.

Step 0: Get some data if you haven’t already

Here are a few lines of code to get you set up with a few data files if you don’t already have some on your local system.

region_a = ipx.Query('ATL06',[-55, 68, -48, 71],['2019-02-22','2019-02-28'], \
                           start_time='00:00:00', end_time='23:59:59')
region_a.earthdata_login(uid='icepyx_devteam', email='icepyx.dev@gmail.com')
region_a.download_granules(path=path_root)

Step 1: Set data source path

Provide a full path to the data to be read in (i.e. opened). Currently accepted inputs are:

  • a directory

  • a single file

All files to be read in must have a consistent filename pattern. If a directory is supplied as the data source, all files in any subdirectories that match the filename pattern will be included.

S3 bucket data access is currently under development, and requires you are registered with NSIDC as a beta tester for cloud-based ICESat-2 data. icepyx is working to ensure a smooth transition to working with remote files. We’d love your help exploring and testing these features as they become available!

path_root = '/full/path/to/your/data/'
# filepath = path_root + 'ATL06-20181214041627-Sample.h5'
# urlpath = 's3://nsidc-cumulus-prod-protected/ATLAS/ATL03/004/2019/11/30/ATL03_20191130221008_09930503_004_01.h5'

Step 2: Create a filename pattern for your data files

Files provided by NSIDC typically match the format "ATL{product:2}_{datetime:%Y%m%d%H%M%S}_{rgt:4}{cycle:2}{orbitsegment:2}_{version:3}_{revision:2}.h5" where the parameters in curly brackets indicate a parameter name (left of the colon) and character length or format (right of the colon). Some of this information is used during data opening to help correctly read and label the data within the data structure, particularly when multiple files are opened simultaneously.

By default, icepyx will assume your filenames follow the default format. However, you can easily read in other ICESat-2 data files by supplying your own filename pattern. For instance, pattern="ATL{product:2}-{datetime:%Y%m%d%H%M%S}-Sample.h5". A few example patterns are provided below.

# pattern = 'ATL06-{datetime:%Y%m%d%H%M%S}-Sample.h5'
# pattern = 'ATL{product:2}-{datetime:%Y%m%d%H%M%S}-Sample.h5'
# pattern = "ATL{product:2}_{datetime:%Y%m%d%H%M%S}_{rgt:4}{cycle:2}{orbitsegment:2}_{version:3}_{revision:2}.h5"
# grid_pattern = "ATL{product:2}_GL_0311_{res:3}m_{version:3}_{revision:2}.nc"
pattern = "processed_ATL{product:2}_{datetime:%Y%m%d%H%M%S}_{rgt:4}{cycle:2}{orbitsegment:2}_{version:3}_{revision:2}.h5"

Step 3: Create an icepyx read object

The Read object has two required inputs:

  • path = a string with the full file path or full directory path to your hdf5 (.h5) format files.

  • product = the data product you’re working with, also known as the “short name”.

The Read object also accepts two optional keyword inputs:

  • pattern = a formatted string indicating the filename pattern required for Intake’s path_as_pattern argument.

  • catalog = a string with the full path to an Intake catalog, for users who wish to use their own catalog (note this may have unintended consequenses if multiple granules are being combined).

reader = ipx.Read(data_source=path_root, product="ATL06", filename_pattern=pattern) # or ipx.Read(filepath, "ATLXX") if your filenames match the default pattern
reader._filelist

Step 4: Specify variables to be read in

To load your data into memory or prepare it for analysis, icepyx needs to know which variables you’d like to read in. If you’ve used icepyx to download data from NSIDC with variable subsetting (which is the default), then you may already be familiar with the icepyx Variables module and how to create and modify lists of variables. We showcase a specific case here, but we encourage you to check out the icepyx Variables example for a thorough trip through how to create and manipulate lists of ICESat-2 variable paths (examples are provided for multiple data products).

If you want to see a [likely very long] list of all path + variable combinations available to you, this unmutable (unchangeable) list is generated by default from the first file in your list (so not all variables may be contained in all of the files, depending on how you are accessing the data).

reader.vars.avail()

To make things easier, you can use icepyx’s built-in default list that loads commonly used variables for your non-gridded data product, or create your own list of variables to be read in. icepyx will determine what variables are available for you to read in by creating a list from one of your source files. If you have multiple files that you’re reading in, icepyx will automatically generate a list of filenames and take the first one to get the list of available variables.

Thus, if you have different variables available across files (even from the same data product), you may run into issues and need to come up with a workaround (we can help you do so!). We anticipate most users will have the minimum set of variables they are seeking to load available across all data files, so we’re not currently developing this feature. Please get in touch if it would be a helpful feature for you or if you encounter this problem!

You may create a variable list for gridded ICESat-2 products. However, all variables in the file will still be added to your DataSet. (This is an area we’re currently exploring on expanding - please let us know if you’re working on this and would like to contribute!)

For a basic case, let’s say we want to read in height, latitude, and longitude for all beam pairs. We create our variables list as

reader.vars.append(var_list=['h_li', "latitude", "longitude"])

Then we can view a dictionary of the variables we’d like to read in.

reader.vars.wanted

Don’t forget - if you need to start over, and re-generate your wanted variables list, it’s easy!

reader.vars.remove(all=True)

Step 5: Loading your data

Now that you’ve set up all the options, you’re ready to read your ICESat-2 data into memory!

ds = reader.load()

Within a Jupyter Notebook, you can get a summary view of your data object.

ATTENTION: icepyx loads your data by creating an Xarray DataSet for each input granule and then merging them. In some cases, the automatic merge fails and needs to be handled manually. In these cases, icepyx will return a warning with the error message from the failed Xarray merge and a list of per-granule DataSets

This can happen if you unintentionally provide the same granule multiple times with different filenames or in segmented products where the rgt+cycle automatically generated gran_idx values match. In this latter case, you can simply provide unique gran_idx values for each DataSet in ds and run import xarray as xr and ds_merged = xr.merge(ds) to create one merged DataSet.

ds

On to data analysis!

From here, you can begin your analysis. Ultimately, icepyx aims to include an Xarray extension with ICESat-2 aware functions that allow you to do things like easily use only data from strong beams. That functionality is still in development. For fun, we’ve included a basic plot made with Xarray’s built in functionality.

ds.plot.scatter(x="longitude", y="latitude", hue="h_li", vmin=-100, vmax=2000)

A developer note to users: our next steps will be to create an xarray extension with ICESat-2 aware functions (like “get_strong_beams”, etc.). Please let us know if you have any ideas or already have functions developed (we can work with you to add them, or add them for you!).

More on Intake catalogs and the read object

As anyone familiar with ICESat-2 hdf5 files knows, one of the challenges to reading in data is looping through all of the beam pairs for each track. The icepyx read module takes advantage of icepyx’s variables module, which has some awareness of ICESat-2 data and uses that to save the user the trouble of having to loop through each beam pair. The reader.load() function does this by automatically creating minimal Intake catalogs for each variable path, reading in the data, and merging each variable into a ready-to-analyze Xarray DataSet. The Intake savvy user may wish to view the template catalog or use an existing catalog.

Viewing the template catalog

You can access the ICESat-2 catalog template as an attribute of the read object.

NOTE: accessing reader.is2catalog creates a template with a placeholder in the ‘group’ parameter; thus, it will not work to actually read in data

reader.is2catalog
reader.is2catalog.gui

Use an existing catalog

If you already have a catalog for your data, you can supply that when you create the read object.

catpath = path_root + 'test_catalog.yml'
reader = ipx.Read(filepath, pattern, catpath)

Then, you can use the catalog you supplied by calling intake’s read directly to read in the specified data variable.

ds = reader.is2catalog.read()

NOTE: this means that you will only be able to read in a single data variable!

To take advantage of icepyx’s knowledge of ICESat-2 data nesting of beam pairs and read in multiple related variables at once, you must use the variable approach outlined earlier in this tutorial.

ds = reader.load()
ds

More customization options

If you’d like to use the icepyx ICESat-2 Catalog template to create your own customized catalog, we recommend that you access the build_catalog function directly, which returns an Intake Catalog instance.

You’ll need to supply the required data_source, path_pattern, and source_type arguments. data_source and path_pattern are described in Steps 2 and 3 of this tutorial. source_type is the string you’d like to use for your Local Catalog entry.

This function accepts as keyword input arguments (kwargs) dictionaries with appropriate keys (depending on the Intake driver you are using). The simplest version of this is specifying the variable parameters and paths of interest. grp_paths may contain “variables”, each of which must then be further defined by grp_path_params. You cannot use glob-like path syntax to access variables (so grp_path = '/*/land_ice_segments' is NOT VALID).

import icepyx.core.is2cat as is2cat

# build a custom ICESat-2 catalog with a group and parameter
cat = is2cat.build_catalog(data_source = path_root,
                           path_pattern = pattern,
                           source_type = "manual_catalog",
                     grp_paths = "/{{gt}}/land_ice_segments",
                     grp_path_params = [{"name": "gt",
                                         "description": "Ground track",
                                         "type": "str",
                                         "default": "gt1l",
                                         "allowed": ["gt1l", "gt1r", "gt2l", "gt2r", "gt3l", "gt3r"]
                                        }]
                    )

Saving your catalog

If you create a highly customized ICESat-2 catalog, you can use Intake’s save to export it as a .yml file.

Don’t forget you can easily use an existing catalog (such as this highly customized one you just made) to read in your data with reader = ipx.Read(filepath, pattern, catalog) (so it’s as easy as re-creating your reader object with your modified catalog).

catpath = path_root + 'test_catalog.yml'
cat.save(catpath)

Credits

  • original notebook by: Jessica Scheick

  • notebook contributors: Wei Ji and Tian

  • templates for default ICESat-2 Intake catalogs from: Wei Ji and Tian.