Analyzing hysteresis loops using the General T-Q SOM#

This tutorial will demonstrate a basic workflow for analyzing suspended sediment transport hysteresis loops using the pre-trained General T-Q SOM. For details on how this SOM was trained, check out our research paper.
The workflow will be illustrated using a time series of discharge and turbidity for which a set of hydrologic events where previously delineated.

Beyond HySOM (and its dependencies: numpy, matplotlib and numba), Pandas is required to run this notebook, so make sure you have it installed.

from hysom.utils.datasets import get_01191000_qt_data, get_01191000_events_data
from hysom.utils.plots import heat_map, heatmap_frequency
from hysom.pretrainedSOM import get_generalTQSOM
from dataclasses import dataclass, field
import matplotlib.pyplot as plt
from datetime import datetime
import numpy.typing as npt 
from typing import Any 
import pandas as pd
import numpy as np

Load data#

First, let’s load a sample dataset included in the HySOM package. The dataset contains two main pices of data:

qt: Pandas dataframe with time series data of discharge and turbidity
event_times: list of (start,end) times for each event

qt = pd.DataFrame(get_01191000_qt_data())
qt.set_index("datetime", inplace=True, drop=True)
event_times = get_01191000_events_data()

print(f"{len(event_times)} events. First 5:")
print(*event_times[:5], sep= "\n")
print("\nTime Series:")
qt.head()

253 events. First 5:
(datetime.datetime(2016, 6, 4, 8, 8, 46, tzinfo=datetime.timezone.utc), datetime.datetime(2016, 6, 9, 6, 2, 42, tzinfo=datetime.timezone.utc))
(datetime.datetime(2016, 6, 9, 19, 8, 42, tzinfo=datetime.timezone.utc), datetime.datetime(2016, 6, 14, 17, 2, 39, tzinfo=datetime.timezone.utc))
(datetime.datetime(2016, 6, 29, 8, 33, 29, tzinfo=datetime.timezone.utc), datetime.datetime(2016, 7, 4, 6, 27, 25, tzinfo=datetime.timezone.utc))
(datetime.datetime(2016, 7, 6, 15, 57, 4, tzinfo=datetime.timezone.utc), datetime.datetime(2016, 7, 9, 9, 27, 2, tzinfo=datetime.timezone.utc))
(datetime.datetime(2016, 7, 9, 11, 16, 12, tzinfo=datetime.timezone.utc), datetime.datetime(2016, 7, 11, 20, 34, 55, tzinfo=datetime.timezone.utc))

Time Series:

	Qcms	turb
datetime
2016-05-31 21:15:00+00:00	0.387941	2.4
2016-05-31 21:30:00+00:00	0.387941	2.5
2016-05-31 21:45:00+00:00	0.370951	2.3
2016-05-31 22:00:00+00:00	0.370951	2.3
2016-05-31 22:15:00+00:00	0.370951	2.3

Plot data:#

fig, (axq, axt) = plt.subplots(2,1, sharex = True, figsize = (15,4))

axq.plot(qt["Qcms"])
axq.set_ylabel("Discharge (cms)")
axt.plot(qt["turb"])
axt.set_ylabel("Turbidity (NTU)")

for start, end in event_times:
    for ax in [axq, axt]:
        ax.axvspan(xmin = start, xmax = end, alpha = 0.3, color = "grey")

../_images/80419f0d776407d19ba5202c442cc56869dbdfb9fb32bfb81688bd7d305b389c.png

Preprocess data#

Let’s prepare the data to make it compatible with HSOM. For a detailed explanation of the preprocessing steps check out this tutorial

To facilitate the analysis, I’ll first define a helper function to interpolate C-Q data (loop_interpolation) and a data structure to easily store and access event information including starting and end dates, the corresponding hysteresis loop and additional attributes (Event):

def loop_interpolation(CQtimeSeries: pd.DataFrame, seq_length) -> np.ndarray:
    """
    Interpolates a hysteresis loop in the C-Q plane to make it compatible with `HSOM`. 
    The resulting loop will have `seq_length` evenly spaced data points.
    Note that the interpolation procedure ignores time information

    Parameters:
    - CQtimeSeries (pd.DataFrame): Pandas DataFrame with two columns. Typically: (discharge, concentration)
    - seq-length (int): number  
    
    Returns:
    - np.ndarray: Numpy array of shape seq_length x 2 with the interpolated sequence

    """
    col1, col2 = CQtimeSeries.columns
    accum_dists = ( 
        (CQtimeSeries[col1] - CQtimeSeries.shift(1)[col1])**2 + (CQtimeSeries[col2] - CQtimeSeries.shift(1)[col2])**2
        ).apply(np.sqrt).cumsum()
    accum_dists.iloc[0] = 0.0
    path_length = accum_dists.max()
    
    interp_dists = np.linspace(0,path_length, seq_length)
    
    col1_interp = np.interp(interp_dists, accum_dists, CQtimeSeries[col1])
    col2_interp = np.interp(interp_dists, accum_dists, CQtimeSeries[col2])
    
    return np.stack((col1_interp, col2_interp), axis = 1)

@dataclass
class Event:
    start: datetime
    end: datetime
    loop: npt.ArrayLike = field(repr=False)
    attributes: dict[str, Any] = field(default_factory=dict, repr=False)

    def get_attribute(self, key: str, default: Any = None):
        return self.attributes.get(key, default)

    def set_attribute(self, key: str, val: Any):
        self.attributes[key] = val

Extract loops and some atrributes (maximum discharge and turbidity) for each event and store them in a list:

loops_and_metrics = []
events = []
for start, end in event_times:
    qtevent = qt[start:end]
    qtnormalized = (qtevent - qtevent.min()) / (qtevent.max() - qtevent.min())
    loop = loop_interpolation(qtnormalized, 100)
    qmax = qtevent["Qcms"].max()
    turbmax = qtevent["turb"].max()
    events.append(Event(start = start, end = end, loop=loop, attributes={"qmax":qmax, "turbmax":turbmax}))

Let’s visualize some loops

fig, axs = plt.subplots(6,5, figsize = (5,6))

for event, ax in zip(events[:30], axs.flatten()):
    loop = event.loop
    ax.scatter(loop[:,0], loop[:,1], cmap = "inferno", c = range(len(loop)), s = 2)
    ax.tick_params(labelbottom = False, labelleft = False)

../_images/83b437dd5f37ca975012f2f291be408912d3724ece60db0e7a428964e67bd935.png

Analyze the data with the General T-Q SOM#

Let’s load the General T-Q SOM first#

TQsom = get_generalTQSOM()

Plot the frequency distribution including all loops:

heatmap_frequency(TQsom, [ev.loop for ev in events])

../_images/ef7dbf9b22b359dd2fdf9804297c88403b0f51b863b44193a5d428815ef1d43f.png

Now, let’s add a couple more attributes to each event: the bmu and the distance_to_bmu as measured by the distance function (in this case: DTW, See more details in our research paper and SI)

for ev in events:
    ev.set_attribute("bmu",TQsom.get_BMU(ev.loop))
    ev.set_attribute("distance_to_bmu",TQsom.get_distance_to_bmu(ev.loop))

See below the frequency distribution of distance_to_bmu values. Note that this distance measures the similarity between a loop and its associated BMU, with larger distances indicating less similarity. Loops that are too different from their BMU should be filter out to reduce noise and misleading results.

_ = plt.hist([ev.get_attribute("distance_to_bmu") for ev in events])

../_images/f09e7ede6d6228f6120cbb3f5133f9209e8779408292ad5ca59d6c448e34cdfc.png

Let’s use a threshold of 1.5 as an acceptable value (this is a rough estimate. In a real analysis you want to find a value that works for you)

filtered_events = list(filter(lambda x: x.get_attribute("distance_to_bmu") < 1.5, events))

print(f"{len(filtered_events)} valid events")

197 valid events

See below the frequency distribution with the filtered data:

heatmap_frequency(TQsom, [ev.loop for ev in filtered_events], coordinates_style="matrix")

../_images/cf8289660b5b967e90b804f77645f2d08e0f5344067feb4186ec03ff159ca21a.png

Let’s plot now a heat map to visualize the association between loop types and discharge and tubidity:

loops,qmax,turbmax = zip(*[(ev.loop, ev.get_attribute("qmax"), ev.get_attribute("turbmax")) for ev in filtered_events ])

heat_map(TQsom, loops, qmax, colorbar_label="Median of Max discharge", agg_method=np.median)

../_images/21d96dcf2c175cd1f7f67f699232bdafb9415aa79f704eba3e2b82ec135b682b.png

heat_map(TQsom, loops, turbmax, colorbar_label="Median of Max turbidity", agg_method=np.median)

../_images/8da8d707d34e6d3fb47011413322b4f0ab13c24fef21fb95737f6b7dce23c426.png

You can also extract the frequency matrix and attribute matrices shown in the heatmaps:

TQsom.frequency_matrix([ev.loop for ev in filtered_events])

array([[ 6.,  1.,  1.,  0.,  0.,  1.,  0.,  0.],
       [ 0.,  1.,  0.,  1.,  0.,  0.,  0.,  0.],
       [11.,  2.,  1.,  0.,  1.,  0.,  1.,  0.],
       [ 5.,  2.,  1.,  1.,  0.,  0.,  0.,  0.],
       [ 6.,  6.,  2.,  0.,  1.,  0.,  0.,  0.],
       [41., 12.,  2.,  1.,  0.,  0.,  0.,  0.],
       [ 9., 26.,  2.,  3.,  0.,  0.,  0.,  0.],
       [15., 19., 15.,  1.,  0.,  0.,  0.,  0.]])

TQsom.attribute_matrix([ev.loop for ev in filtered_events], [ev.get_attribute("qmax") for ev in filtered_events]).round(2) # round to 2 decimals for visualization only

array([[3.04, 0.81, 0.21,  nan,  nan, 1.13,  nan,  nan],
       [ nan, 2.15,  nan, 2.41,  nan,  nan,  nan,  nan],
       [2.23, 1.27, 1.93,  nan, 1.6 ,  nan, 2.52,  nan],
       [1.79, 0.93, 2.05, 2.45,  nan,  nan,  nan,  nan],
       [4.4 , 3.09, 1.17,  nan, 0.43,  nan,  nan,  nan],
       [5.52, 5.47, 2.1 , 0.5 ,  nan,  nan,  nan,  nan],
       [2.33, 9.37, 1.72, 1.6 ,  nan,  nan,  nan,  nan],
       [5.44, 9.6 , 9.03, 5.64,  nan,  nan,  nan,  nan]])

TQsom.attribute_matrix([ev.loop for ev in filtered_events], [ev.get_attribute("turbmax") for ev in filtered_events]).round(2) # round to 2 decimals for visualization only

array([[437.  ,  72.7 , 110.  ,    nan,    nan, 384.  ,    nan,    nan],
       [   nan,  38.8 ,    nan, 265.  ,    nan,    nan,    nan,    nan],
       [ 78.4 ,  34.7 ,  37.5 ,    nan,  83.6 ,    nan, 285.  ,    nan],
       [ 26.  ,  36.25,  11.4 ,  38.7 ,    nan,    nan,    nan,    nan],
       [141.5 ,  29.55,  14.5 ,    nan,  12.2 ,    nan,    nan,    nan],
       [168.  ,  71.75,  26.85,  14.9 ,    nan,    nan,    nan,    nan],
       [146.  , 155.  ,  15.9 ,  30.7 ,    nan,    nan,    nan,    nan],
       [497.  , 126.  ,  92.5 ,  29.7 ,    nan,    nan,    nan,    nan]])