# Writing Custom Cross-Validation Methods For Grid Search in Scikit-learn

— data-science, machine-learning, python — 2 min read

Recently I was interested in applying Blocking Time Series Split following this lovely post in a Grid Search hyper-parameter tuning setting using scikit-learn library to maintain the time order and prevent information leakage. In this post, I will try to document some knowledge that I build while reading through the articles, documentation, and blog posts about custom cross-validation generators in Python.

It is great that scikit-learn provides a class called `TimeSeriesSplit`

, and by using that we can generate fixed time interval training and test sets. Here is a basic example using scikit-learn data generators. I generate a regression dataset with 5 features and 30 samples. Then I generate 3 splits. For those 3 splits, we obtain 10 training examples and `n_samples//(n_splits + 1)`

test examples:

```
1import numpy as np2from sklearn.datasets import make_regression3from sklearn.model_selection import TimeSeriesSplit4
5X_experiment, y_experiment = make_regression(6 n_samples=30, n_features=5, noise=0.2)7
8tscv = TimeSeriesSplit(max_train_size=10, n_splits=3)9
10for idx, (x, y) in enumerate(tscv.split(X_experiment)):11 print(f"Split number: {idx}")12 print(f"Training indices: {x}")13 print(f"Test indices: {y}\n")
```

Here the output will be, and it will follow a Walk Forward Cross Validation pattern:

```
1Split number: 02Training indices: [0 1 2 3 4 5 6 7 8]3Test indices: [ 9 10 11 12 13 14 15]4
5Split number: 16Training indices: [ 6 7 8 9 10 11 12 13 14 15]7Test indices: [16 17 18 19 20 21 22]8
9Split number: 210Training indices: [13 14 15 16 17 18 19 20 21 22]11Test indices: [23 24 25 26 27 28 29]
```

However, the setting that I found was using dates instead of timestamps. This was leading **to discrete numeric values** as anchor points for cross-validation splits, **instead of continuous**. Hence, I was not able to leverage the `TimeSeriesSplit`

from scikit-learn. Instead, I wrote a simple generator object with groupings for date splits to use in Grid Search.

```
1class CustomCrossValidation:2
3 @classmethod4 def split(cls,5 X: pd.DataFrame,6 y: np.ndarray = None,7 groups: np.ndarray = None):8 """Returns to a grouped time series split generator."""9 assert len(X) == len(groups), (10 "Length of the predictors is not"11 "matching with the groups.")12 # The min max index must be sorted in the range13 for group_idx in range(groups.min(), groups.max()):14
15 training_group = group_idx16 # Gets the next group right after17 # the training as test18 test_group = group_idx + 119 training_indices = np.where(20 groups == training_group)[0]21 test_indices = np.where(groups == test_group)[0]22 if len(test_indices) > 0:23 # Yielding to training and testing indices24 # for cross-validation generator25 yield training_indices, test_indices
```

`CustomCrossValidation`

is a simple class with one method (`split`

) uses X (predictors), y (target values), and groups corresponding to the date groups. Those can be months or quarters for your dataset, however, I assumed that those can be mapped into integers to keep the order of time. Hence, if I have 3 quarters in the dataset, I can first have `Q1`

, `Q2`

, and `Q3`

as of date values. But I can simply map those into 0, 1, 2 to keep the order and use those in my validation generator class method.

The `split`

method, with this naming, is required for `GridSearchCV`

in scikit-learn. Here, I created a range of integers (groups) to keep the order of date. Then assigned the first group indices (t) to be training indices and the next (t + 1) to be validation indices. Then, in the end, the method yields to training and testing indices as the `cv`

parameter of the `GridSearchCV`

method requires a generator object with returning training and testing indices.

Here the example displays how the custom split works with the groups. To have different sizes of date groups, I created 4 groups with 5 instances of 0s, 10 instances of 1s, 10 instances of 2s, and 10 instances of 3s:

```
1X_experiment, y_experiment = make_regression(2 n_samples=30, n_features=5, noise=0.2)3
4groups_experiment = np.concatenate([np.zeros(5), # 5 0s5 np.ones(10), # 10 1s6 2 * np.ones(10), # 10 2s7 3 * np.ones(5) # 10 3s8 ]).astype(int)9
10for idx, (x, y) in enumerate(11 CustomCrossValidation.split(X_experiment,12 y_experiment,13 groups_experiment)):14 print(f"Split number: {idx}")15 print(f"Training indices: {x}")16 print(f"Test indices: {y}\n")
```

The example dataset will look like with the groupings:

```
1# The first 5 predictor values...2 0 1 2 3 430 -0.566298 0.099651 2.190456 -0.503476 -0.99053641 0.174578 0.257550 0.404051 -0.074446 1.88618652 0.314247 -0.908024 -0.562288 -1.412304 -1.01283163 -1.106335 -1.196207 -0.479174 0.812526 -0.18565974 -0.013497 -1.057711 -0.601707 0.822545 1.8522788
9# The first 5 target values...10 0110 73.398681121 195.221637132 -139.402678143 -124.863423154 94.75351716
17# Groupings for the example dataset...18# The 0s are older date anchor values, whereas 3s the newest...19[0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3]
```

The groups will be used for having an order in the validation flow. Hence first the 0s are going to be used as the training set, and 1s as validation. Then the 1s are going to be used as training, the 2s as validation... The output of the example generated indices will be:

```
1Split number: 02Training indices: [0 1 2 3 4]3Test indices: [ 5 6 7 8 9 10 11 12 13 14]4
5Split number: 16Training indices: [ 5 6 7 8 9 10 11 12 13 14]7Test indices: [15 16 17 18 19 20 21 22 23 24]8
9Split number: 210Training indices: [15 16 17 18 19 20 21 22 23 24]11Test indices: [25 26 27 28 29]
```

To have an example setup, I will be using the Lasso Regression and try to optimize the alpha with Grid Search. In Lasso, when we have a larger alpha, this forces more coefficients to be 0. It is very common to search for the optimum values of alpha in a Lasso Regression.

```
1# Instantiating the Lasso estimator2reg_estimator = linear_model.Lasso()3# Parameters4parameters_to_search = {"alpha": [0.1, 1, 10]}5# Splitter6custom_splitter = CustomCrossValidation.split(7 X=X_experiment,8 y=y_experiment,9 groups=groups_experiment)10
11# Search setup12reg_search = GridSearchCV(13 estimator=reg_estimator,14 param_grid=parameters_to_search,15 scoring="neg_root_mean_squared_error",16 cv=custom_splitter)17# Fitting18best_model = reg_search.fit(19 X=X_experiment,20 y=y_experiment,21 groups=groups_experiment)
```

This will output the best estimator as follows, using the custom cross-validation. There will be 3 splits as we used 4 groups.

```
1# Best model:2Lasso(alpha=0.1)3
4# Number of splits:53
```

Voila, having a simple generator helped me to have a custom validation flow in a Grid Search optimization. I enjoy reading scikit-learn documentation. Besides the fact that reading is fun, it helps me to understand some statistical implementations better and tweak whenever it is necessary.

To have a complete set of examples, please refer to the Github repository. Happy reading the documentation!