Using the IMI Clustering Options

Why use the clustering options?

The main computational cost of the IMI is running the perturbation simulations necessary to construct the jacobian. This requires running a (GEOS-Chem) Jacobian simulation for each state vector element. The default state vector that is generated with the IMI has state vector elements in native resolution, meaning each element corresponds with a GEOS-Chem grid cell (.25 degree or .5 degree resolution). However, if your state vector has a sufficiently large number of elements this can limit the feasibility of running the IMI – either due to prohibitively high AWS costs or compute time. Clustering your state vector elements reduces the number of state vector elements by aggregating elements together.

Using the IMI clustering config options

To enable the IMI clustering options in the imi config file set ReducedDimensionStateVector: true. This enables the clustering component of the IMI. Once enabled the IMI uses your specified NumberOfElements to aggregate native resolution state vector elements within your domain of interest using the specified ClusteringMethod. eg:

ReducedDimensionStateVector: true
ClusteringMethod: "kmeans"
NumberOfElements: 39

This automatically generates a state vector with 39 elements (including buffer elements) in the domain of interest. This is done by creating a set of information content informed clustering pairs (eg. [[1, 15], [2, 24]]). Note: As you reduce the dimension of your state vector, you should also correspondingly decrease the value of your regularization factor Gamma. It can be scaled by the ratio of reduced number of elements over the original number of elements (eg. len(new_elements)/len(orig_elements)).

Each clustering pair consists of the the aggregation level and the number of cells you are allocating with the aggregation level. In the above example, the user is requesting 39 total state vector elements and the algortithm determines the information content informed pattern to be 15 native resolution state vector elements and 24 state vector elements to be aggregated with another element. Any additional elements that have not been allocated are then aggregated into a single element. Using the above clustering pairs, if the domain of interest has 63 elements in the original state vector, 15 of the elements would maintain the original resolution and 48 of the elements would be aggregated into 24 2-gridcell elements. If the original state vector has 75 elements in the domain of interest, then the remaining 12 unallocated elements are aggregated into a single element, netting a new state vector with 40 elements in the domain of interest.

The cluster pairings are generated by aggregating elements until they reach a threshold in the estimated DOFS per cluster, which is a measure of information content. We find using the threshold of total_DOFs / num_state_vector_elements provides a reasonable result.

The ClusteringMethod specifies which clustering method to use for state vector reduction. Currently kmeans or mini-batch-kmeans are valid options. mini-batch-kmeans is very similar to kmeans, but can be less accurate. It is best used for very large state vectors to speed up state vector reduction.

Note: The IMI preserves the original state vector file as NativeStateVector.nc in your run directory.

Incorporating point source information

If you have prior information of specific locations that you would like to maintain high resolution (eg. point source detections) you can ensure the clustering algorithm preserves these locations by using the ForcedNativeResolutionElements config variable. This variable takes a list of lat/lon locations using either yaml list or a path to a csv file.

For instance, if the user suspects a location to be an emission hotspot they can specify the lat/lon coordinates as in the examples below and the clustering algorithm will ensure that the native resolution element is preserved during the aggregation. In order for the IMI to preserve the element, you must have enough NumberOfElements specified to accomodate the number of gridcells you would like to force to be native resolution.

Additionally, the PointSourceDatasets config variable can be used to automatically scrape emission hotspots from external point source datasets. Currently, the only supported dataset is the "SRON" weekly plumes dataset.

yaml list example:

PointSourceDatasets: ["SRON"]
ForcedNativeResolutionElements:
  - [31.5, -104]
  - [32.5, -103.5]

csv file example:

PointSourceDatasets: ["SRON"]
ForcedNativeResolutionElements: "/path/to/point_source_locations.csv"

The csv file should have a header row with the column names lat and lon using lowercase letters. The csv file can have additional columns, but they will be ignored.

Dynamic Kalman Filter clustering

When running the IMI in Kalman Filter mode, users can dynamically adjust clusters at each Kalman iteration to best reflect the available information content by setting the DynamicKFClustering variable to true. See the Kalman Filter IMI documentation for more details.

IMI clustering scheme

The IMI clustering algorithm uses a similar k-means based method as described in Nesser et al., 2021 to maintain native resolution in areas with high information content (high prior emissions, high observation density), while aggregating cells with low information content.

Reducing computational cost while maintaining inversion quality

While clustering is an effective method for alleviating computational constraints for running inversions at high resolution for large regions, it can introduce aggregation error and degrade the quality of your inversion (Turner and Jacob., 2014 ). Therefore, it is important to weigh the computational benefits of reducing your state vector against the inversion quality loss. This can be done by iteratively tuning the cluster pairings and running the IMI preview.IMI preview to assess the estimated DOFS. Ideally, you should find a middle groud where the estimated DOFS and computation cost is at a acceptable level before proceeding with the inversion.