Skip to content
Nazli Ander
TwitterGithubDEV

Writing Custom Cross-Validation Methods For Grid Search in Scikit-learn

data-science, machine-learning, python2 min read

Recently I was interested in applying Blocking Time Series Split following this lovely post in a Grid Search hyper-parameter tuning setting using scikit-learn library to maintain the time order and prevent information leakage. In this post, I will try to document some knowledge that I build while reading through the articles, documentation, and blog posts about custom cross-validation generators in Python.

It is great that scikit-learn provides a class called TimeSeriesSplit, and by using that we can generate fixed time interval training and test sets. Here is a basic example using scikit-learn data generators. I generate a regression dataset with 5 features and 30 samples. Then I generate 3 splits. For those 3 splits, we obtain 10 training examples and n_samples//(n_splits + 1) test examples:

1import numpy as np
2from sklearn.datasets import make_regression
3from sklearn.model_selection import TimeSeriesSplit
4
5X_experiment, y_experiment = make_regression(
6 n_samples=30, n_features=5, noise=0.2)
7
8tscv = TimeSeriesSplit(max_train_size=10, n_splits=3)
9
10for idx, (x, y) in enumerate(tscv.split(X_experiment)):
11 print(f"Split number: {idx}")
12 print(f"Training indices: {x}")
13 print(f"Test indices: {y}\n")

Here the output will be, and it will follow a Walk Forward Cross Validation pattern:

1Split number: 0
2Training indices: [0 1 2 3 4 5 6 7 8]
3Test indices: [ 9 10 11 12 13 14 15]
4
5Split number: 1
6Training indices: [ 6 7 8 9 10 11 12 13 14 15]
7Test indices: [16 17 18 19 20 21 22]
8
9Split number: 2
10Training indices: [13 14 15 16 17 18 19 20 21 22]
11Test indices: [23 24 25 26 27 28 29]

However, the setting that I found was using dates instead of timestamps. This was leading to discrete numeric values as anchor points for cross-validation splits, instead of continuous. Hence, I was not able to leverage the TimeSeriesSplit from scikit-learn. Instead, I wrote a simple generator object with groupings for date splits to use in Grid Search.

1class CustomCrossValidation:
2
3 @classmethod
4 def split(cls,
5 X: pd.DataFrame,
6 y: np.ndarray = None,
7 groups: np.ndarray = None):
8 """Returns to a grouped time series split generator."""
9 assert len(X) == len(groups), (
10 "Length of the predictors is not"
11 "matching with the groups.")
12 # The min max index must be sorted in the range
13 for group_idx in range(groups.min(), groups.max()):
14
15 training_group = group_idx
16 # Gets the next group right after
17 # the training as test
18 test_group = group_idx + 1
19 training_indices = np.where(
20 groups == training_group)[0]
21 test_indices = np.where(groups == test_group)[0]
22 if len(test_indices) > 0:
23 # Yielding to training and testing indices
24 # for cross-validation generator
25 yield training_indices, test_indices

CustomCrossValidation is a simple class with one method (split) uses X (predictors), y (target values), and groups corresponding to the date groups. Those can be months or quarters for your dataset, however, I assumed that those can be mapped into integers to keep the order of time. Hence, if I have 3 quarters in the dataset, I can first have Q1, Q2, and Q3 as of date values. But I can simply map those into 0, 1, 2 to keep the order and use those in my validation generator class method.

The split method, with this naming, is required for GridSearchCV in scikit-learn. Here, I created a range of integers (groups) to keep the order of date. Then assigned the first group indices (t) to be training indices and the next (t + 1) to be validation indices. Then, in the end, the method yields to training and testing indices as the cv parameter of the GridSearchCV method requires a generator object with returning training and testing indices.

Here the example displays how the custom split works with the groups. To have different sizes of date groups, I created 4 groups with 5 instances of 0s, 10 instances of 1s, 10 instances of 2s, and 10 instances of 3s:

1X_experiment, y_experiment = make_regression(
2 n_samples=30, n_features=5, noise=0.2)
3
4groups_experiment = np.concatenate([np.zeros(5), # 5 0s
5 np.ones(10), # 10 1s
6 2 * np.ones(10), # 10 2s
7 3 * np.ones(5) # 10 3s
8 ]).astype(int)
9
10for idx, (x, y) in enumerate(
11 CustomCrossValidation.split(X_experiment,
12 y_experiment,
13 groups_experiment)):
14 print(f"Split number: {idx}")
15 print(f"Training indices: {x}")
16 print(f"Test indices: {y}\n")

The example dataset will look like with the groupings:

1# The first 5 predictor values...
2 0 1 2 3 4
30 -0.566298 0.099651 2.190456 -0.503476 -0.990536
41 0.174578 0.257550 0.404051 -0.074446 1.886186
52 0.314247 -0.908024 -0.562288 -1.412304 -1.012831
63 -1.106335 -1.196207 -0.479174 0.812526 -0.185659
74 -0.013497 -1.057711 -0.601707 0.822545 1.852278
8
9# The first 5 target values...
10 0
110 73.398681
121 195.221637
132 -139.402678
143 -124.863423
154 94.753517
16
17# Groupings for the example dataset...
18# The 0s are older date anchor values, whereas 3s the newest...
19[0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3]

The groups will be used for having an order in the validation flow. Hence first the 0s are going to be used as the training set, and 1s as validation. Then the 1s are going to be used as training, the 2s as validation... The output of the example generated indices will be:

1Split number: 0
2Training indices: [0 1 2 3 4]
3Test indices: [ 5 6 7 8 9 10 11 12 13 14]
4
5Split number: 1
6Training indices: [ 5 6 7 8 9 10 11 12 13 14]
7Test indices: [15 16 17 18 19 20 21 22 23 24]
8
9Split number: 2
10Training indices: [15 16 17 18 19 20 21 22 23 24]
11Test indices: [25 26 27 28 29]

To have an example setup, I will be using the Lasso Regression and try to optimize the alpha with Grid Search. In Lasso, when we have a larger alpha, this forces more coefficients to be 0. It is very common to search for the optimum values of alpha in a Lasso Regression.

1# Instantiating the Lasso estimator
2reg_estimator = linear_model.Lasso()
3# Parameters
4parameters_to_search = {"alpha": [0.1, 1, 10]}
5# Splitter
6custom_splitter = CustomCrossValidation.split(
7 X=X_experiment,
8 y=y_experiment,
9 groups=groups_experiment)
10
11# Search setup
12reg_search = GridSearchCV(
13 estimator=reg_estimator,
14 param_grid=parameters_to_search,
15 scoring="neg_root_mean_squared_error",
16 cv=custom_splitter)
17# Fitting
18best_model = reg_search.fit(
19 X=X_experiment,
20 y=y_experiment,
21 groups=groups_experiment)

This will output the best estimator as follows, using the custom cross-validation. There will be 3 splits as we used 4 groups.

1# Best model:
2Lasso(alpha=0.1)
3
4# Number of splits:
53

Voila, having a simple generator helped me to have a custom validation flow in a Grid Search optimization. I enjoy reading scikit-learn documentation. Besides the fact that reading is fun, it helps me to understand some statistical implementations better and tweak whenever it is necessary.

To have a complete set of examples, please refer to the Github repository. Happy reading the documentation!

This post is also available on DEV.