Using the IMI Clustering Options
================================

Why use the clustering options?
-------------------------------
The main computational cost of the IMI is running the perturbation simulations necessary to 
construct the jacobian. This requires running a (GEOS-Chem) Jacobian simulation for each 
state vector element. The default state vector that is generated with the IMI has state 
vector elements in native resolution, meaning each element corresponds with a GEOS-Chem grid 
cell (at up to 0.125°×0.15625° resolution). However, if your state vector has a sufficiently 
large number of elements this can limit the feasibility of running the IMI -- either due to
prohibitively high AWS costs or compute time. Clustering your state vector elements reduces 
the number of state vector elements by aggregating elements together. 

Using the IMI clustering config options
---------------------------------------
To enable the IMI clustering options in the imi config file set 
``ReducedDimensionStateVector: true``. This enables the clustering component of the IMI. 
Once enabled the IMI uses your specified ``NumberOfElements`` to aggregate native resolution state vector elements 
within your domain of interest using the specified ``ClusteringMethod``. Additionally, there are two optional 
clustering options ``MaxClusterSize`` and ``ClusteringThreshold`` which, respectively, control the maximum number 
of elements per cluster and the size distribution of clusters. eg:

::

    ReducedDimensionStateVector: true
    ClusteringMethod: "kmeans"
    NumberOfElements: 39
    MaxClusterSize: 6 # Optional
    ClusteringThreshold: 0.4 # Optional

This automatically generates a state vector with 39 elements (including buffer elements) in the 
domain of interest. This is done by attempting to ensure all clusters have information content 
(estimated_DOFS) above a set ``ClusteringThreshold``. If no threshold is set, the algorithm 
will use the default threshold of ``total_estimated_DOFs / NumberOfElements``. Increasing the 
``ClusteringThreshold`` allows reduction of the size distribution of state vector elements. If the 
user specifies a ``MaxClusterSize`` the algorithm will attempt to ensure that the approximate maximum 
cluster size is approximately the specified value. Note: due to the inner workings of kmeans the 
actual MaxClusterSize may be slightly higher or lower than the specified value. If the user does not 
specify a ``MaxClusterSize`` the algorithm uses ``64`` as the default value.

The information content of a cluster is generated by aggregating the estimated averaging kernel sensitivity 
values for the native resolution grid until a cluster reaches the provided ``ClusteringThreshold``. 
The algorithm starts with the smallest clusters possible and iteratively aggregates clusters until they
reach the threshold.

The ``ClusteringMethod`` specifies which clustering method to use for state vector reduction. Currently 
``kmeans`` or ``mini-batch-kmeans`` are valid options. ``mini-batch-kmeans`` is very similar to ``kmeans``, 
but can be less accurate. It is best used for very large state vectors to speed up state vector reduction.

Note: The IMI preserves the original state vector file as ``NativeStateVector.nc`` in your run directory.

Incorporating point source information
--------------------------------------

If you have prior information of specific locations that you would like to maintain high resolution 
(eg. point source detections) you can ensure the clustering algorithm preserves these locations by 
using the ``ForcedNativeResolutionElements`` config variable. This variable takes a list of lat/lon 
locations using either yaml list or a path to a csv file.

For instance, if the user suspects a location to be an emission hotspot they can specify the 
lat/lon coordinates as in the examples below and the clustering algorithm will ensure that the
native resolution element is preserved during the aggregation. In order for the IMI to 
preserve the element, you must have enough ``NumberOfElements`` specified to accomodate the 
number of gridcells you would like to force to be native resolution.

Additionally, the ``PointSourceDatasets`` config variable can be used to automatically scrape emission 
hotspots from external point source datasets. Currently, the supported datasets are the ``"SRON"`` 
`weekly plumes dataset <https://earth.sron.nl/methane-emissions/>`_, ``"CarbonMapper"``, and ``"IMEO"``.

yaml list example:
::
    
    PointSourceDatasets: ["SRON"]
    ForcedNativeResolutionElements:
      - [31.5, -104]
      - [32.5, -103.5]

csv file example:
::
    
    PointSourceDatasets: ["SRON"]
    ForcedNativeResolutionElements: "/path/to/point_source_locations.csv"

The csv file should have a header row with the column names ``lat`` and ``lon`` using lowercase letters. 
The csv file can have additional columns, but they will be ignored.

Dynamic Kalman Filter clustering
--------------------------------
When running the IMI in Kalman Filter mode, users can dynamically adjust clusters at each Kalman iteration 
to best reflect the available information content by setting the ``DynamicKFClustering`` variable to 
``true``. See the `Kalman Filter IMI <../advanced/kalman-filter-mode.html>`__ documentation for more details.

IMI clustering scheme
---------------------
The IMI clustering algorithm uses a similar k-means based method as described in
`Nesser et al., 2021 <https://doi.org/10.5194/amt-14-5521-2021>`_ to maintain native 
resolution in areas with high information content (high prior emissions, high observation 
density), while aggregating cells with low information content.

Reducing computational cost while maintaining inversion quality
---------------------------------------------------------------
While clustering is an effective method for alleviating computational constraints for 
running inversions at high resolution for large regions, it can introduce aggregation error
and degrade the quality of your inversion 
(`Turner and Jacob, 2014 <https://doi.org/10.5194/acp-15-7039-2015>`_ ). 
Therefore, it is important to weigh the computational benefits of reducing your state vector
against the inversion quality loss. This can be done by iteratively tuning the ``NumberOfElements`` 
and running the `IMI preview <../getting-started/imi-preview.html>`__ to assess 
the estimated DOFS. Ideally, you should find a middle groud where the estimated DOFS and 
computation cost is at a acceptable level before proceeding with the inversion.