K-Means Algorithm to Detect the Current Market Regime

K-Means is one of the most popular clustering machine learning algorithms out there. It’s the Oyster Perpetual of the Rolex watches – the entry level Rolex watch. Good? Absolutely. Are there any better? You bet.

K-means is a machine learning algorithm used for clustering. Given a set of data points, the algorithm aims to partition them into groups (or clusters) based on similarity, such that data points in the same cluster are close to each other while being far from points in other clusters. The “means” in the name refers to the averages of the clusters, which are used to determine the proximity of data points to another cluster.

So our idea is:

  • Step 1: Train the algorithm to discover different states of indicators corresponding to different clusters.
  • Step 2: Determine which market regime these clusters belong to.
  • Step 3: Buy only then, when the current state of indicators belong to a cluster which describes a bull market. Sell, then they belong to a different cluster describing a neutral, bear or any other (if we might discover) market.

To demonstrate the K-Means algorithm, let’s generate a sinusoid wave with some noise and run K-Means through it. You can see the result and the idea behind the algorithm in the Fig. 1

One very important thing to consider is the amount of clusters we want the algorithm to discover. For example, if we train the algorithm to find 2 clusters, it will deliver a chart that looks like the one, showed in Fig. 2 – Clusters. So how do we know what is the best number of clusters? For this, a method called Elbow is used. We create a loop where we re-train the K-Means algorithm a number of times using a different parameter for clusters and for the result, within-cluster-sum-of-squares (WCSS) is calculated (meaning smaller number- closer the cluster values are to the mean. Then we plot a chart showing WCSSs and choose a number after which values don’t really get better that much. We search for the elbow. For this sin wave, we would choose 2 as the optimal number of clusters.

Following the process of the above and a little more, I will be applying the K-Means algorithm to determine different market regimes by clustering 6 indicators which at first glance, can help on this task. These are:

  • Levels– RSI
  • Volatility– ATR, VIX
  • Volumes: Volume above its moving average of 50
  • Bonds– Yield Curve
  • Direction: Price Change.

See all the indicators plotted together with SPY. For our algorithm, I will use two datasets: training data that goes from 2000 to 2015 and testing data that goes from 2015 to 2023. Shown in Fig. 3

Fig. 3 – SPY & the Indicators

After this is done, the natural step is to determine the optimal number of clusters for the K-Means algorithm, using the elbow method. As shown in Fig. 4, there is no clear elbow therefore it is difficult to choose an optimal number of clusters. Additionally, we can see that WCSS is very high. For the demonstration script with a sin wave, WCSS was already at 200 at no. of clusters 2 and continues to drop near 0. The market is not that simple and our indicators do not explain the movements of the market- they react to them, therefore by modelling a K-Means algorithm on top of reactive indicators (though we could argue about the yield curve) we will never reach a good certainty level. Let’s choose 6 as our optimal number of clusters and balance accuracy vs overfitting likelihood.

Fig. 4 – The Elbow Method

After determining the optimal number, K-Means algorithm can be trained on the 2000-2015 data. It uncovers 6 states (clusters) of the market by some distribution of the indicators (See Fig. 6), however, the market can only have 3 states: Bull; neutral and bear. To map the clusters to the corresponding market state we could use DTW algorithm, however with only 6 clusters, we can sum the returns for each cluster and chart them. Then visually, it is easy to identify which regime which state corresponds to and chart it with corresponding colors, see Fig. 5.

Fig. 5 – Returns By State

By mapping the States 0, 1 and 3 as a Bull market, State 2 and 4 as bear and State 5 as neutral, we can paint our chart for a visual inspection. In the Fig 6. I show the raw clusters and then mapped. We can clearly see the market regimes, however K-Means did not identify the neutral- it actually mapped our “Neutral” as the bottom movements of the 2008 crisis. We should ideally buy the most during the 4th state.

After this is complete, we run the same, now a trained model with a predict function on the test data (2015-2023) and map the same states to a predicted data- see Fig. 7

Fig. 7 – SPY chart with Training & Testing clusters shown

After this step is complete, the backtest can be made. Let’s determine some rules and then create a backtest (see Fig. 8)

  • Have 100% invested during the states 0, 1 and 3
  • Have 50% invested during the states 2 and 4
  • Have 200% invested during the state 5

Fig. 8 – Backtest

It looks like K-Learn would have correctly predicted market turning points. It is rare to get the 5th state, but it pays. Notice, however, how during normal markets, algorithm underperforms, hence its edge is visible only about 1% of the time. So, let’s get to the downsides of the K-Means.

Negatives

  • K-Means is extremely sensitive on the amount of clusters. One cluster here and there and the whole algorithm outputs different results
  • Additionally, during K-Means initialization, random points are chosen from which algorithm goes on to reach the centroids. Problem, however, is that with complex data (like that for the markets) there can be many local centroids, thus depending on the initialization position, K-Means will converge towards different clusters
  • Susceptible to a regime change. Since we train the model on past data and then it predicts the future values, everything is essentially based on the past- no predictive value

Positives

  • Easy to interpet and understand
  • Can support a huge number of indicators and features
  • Great at finding extremes

To conclude, K-Means is a wonderful entry level machine learning algorithm that can be used to discover states of the market, states of the indicators, stocks and more. It is a highly unstable model, however, that works on a stationary data.

Future advancements for this algorithm will be to initialize it with predetermined centroids, so the model is stable. Additionally, we should try a trailing testing data approach where we would re-train the model with the latest, for example 10 years, data, trying to account for a regime change.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *