Example for Interacting with the Dataset

Table of Contents

Interacte with a Single File

Simulation results are store in NetCDF, individually for each day. Simulations with a particular panel configuration are stored under an individual directory. For example, <DATA_DIR>/module_00/20190619_module_00_analog.nc is the daily simulation file for 06/19/2019 with the module tag 00. Detailed information on module tags can be found in modules.csv.

A summary of the variables included in the NetCDF file is provided below:

Please note that coordinates are not saved in simulation files to prevent redundancy. Coordinates can be found in a separate NetCDF file under the data root folder. These coordinates come from the NAM-NMM parent mesh grid with a 12 km horizontal resolution and they are subset to the continental US domain.

There are many options to work with NetCDF files. To mention a few,

In this example, we are going to use xarray from python for its well-supported functionality and its capability of distributed computing.

Prepare Environment

Preview Module Information

Visualize Daily Simulations

Analysis at Scale

The previous sections demonstrate how to read and visualize daily results with xarray. However, it is also desirable to carry out time series analysis across multiple days. The problem is that the year-round simulation result for a module is about 88 GB. It is hard to fit this amount of data into RAM directly. We need help from distributing computing.

In this section, we show how to work with 88 GB directly of data with distributed computing with xarray and Dask on a cluster. This solution is platform-independent and can be deployed on major clusters/suptercomputers like the NCAR Cheyennne and the Penn State Roar. Dask provides efficient and convenient parallelization for interacting with multiple NetCDF files.

Important: Here, I assume you are executing this notebook on a cluster with the PBS scheduler. You should consult with your IT on which scheduler the cluster users. Currently, the supported schedulers by Dask can be found here.

Request Computing Resources

Visualize Data from Multiple NetCDF Files

At this point, you have access to the year-round ensemble simulation data. Please take a look at the data summary below. Please notice the number of dates in init_time. You can now easily subset and calculate daily/seasonal/annual statstics.

Under the hood, data are not actually read into the memory yet. ds is just a representation of the data structure from the multiple NetCDF files. When you define your calculation and call compute, that is when the actual computation and file I/O happen.