Automated Feature Selection with Hyperopt

Clay Elmore
4 min readMar 14, 2021

Feature selection is a critical component to the machine learning lifecycle as it can affect many aspects of any ML model which are listed, but are not limited, to the list below.

  • Training time
  • Model bias, generality, and extrapolation
  • Accuracy
  • Inference speed
  • Model size (RAM and disk)

There are many ways of selecting features for most tabular learning algorithms including feature importance, feature perturbation, reduction of feature collinearity, etc… However, as the MLOps world moves towards the idea of continuous training (CI-CD-CT), a need for more flexible, automated feature selection is needed in order to create stable training systems that a data scientist does not need to constantly be tuning. These feature selection mechanisms should be grounded in mathematical rigor similar to how hyperparameter optimization has taken off in autoML in the past few years. One of the most popular hyperparameter optimization packages around is the “hyperopt” package which uses Bayesian based optimization algorithms to search ambiguous spaces for optimal solutions to a given function. In the rest of this post, I will show how this package and overall optimization formulation can be extended to automate the feature selection process.

The Data

The data I used for these experiments comes from Kaggle: https://www.kaggle.com/c/tabular-playground-series-mar-2021/. The objective is a binary classification problem containing both categorical and continuous values. The dataset is sufficiently large (300K rows) to make sure real-world data phenomena are seen. First things first — import the libraries.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.preprocessing import OrdinalEncoder
from sklearn.ensemble import RandomForestClassifier
from hyperopt import fmin, tpe, STATUS_OK, Trials, hp, space_eval

Now preprocess the data.

df = pd.read_csv('train.csv')y_var = 'target'
X_vars = [i for i in df.columns if i not in [y_var, 'id']]
cont_vars = [i for i in X_vars if 'cont' in i]
cat_vars = [i for i in X_vars if 'cat' in i]
for col in cat_vars:
le = OrdinalEncoder()
df[col] = le.fit_transform(df[col].to_numpy().reshape(-1, 1))
df[col] = df[col].astype(int)
X_train, X_test, y_train, y_test = train_test_split(df[X_vars], df[y_var])

We have a training and testing set, and are ready to get to the fun part now!

Baseline Model

In order to make sure we are not sacrificing accuracy for the sake of removing features, I created a base model with all features to make sure there is a baseline to compare against. Also, note that I used a Random Forest because I wanted a model with limited hyperspace and not too much power in order to keep focus on the feature space. The training data is also downsampled in order to reduce compute time.

model = RandomForestClassifier()
model.fit(X_train[0:10000], y_train[0:10000])
pred_proba = model.predict_proba(X_test)
auc_base = metrics.roc_auc_score(y_test, pred_proba[:, 1])
print(f'Baseline AUC: {auc:.3f}')
>>> Baseline AUC: 0.866

Feature Selection

Now on to the fun part — how many of the 30 features in the dataset can we get rid of without losing any accuracy? The class below is the base for optimization with hyperopt given a parameter space specified at initialization.

class HpOptBinarySelect:
def __init__(self, X_train, X_test, y_train, y_test, space, model):
self.X_train = X_train
self.X_test = X_test
self.y_train = y_train
self.y_test = y_test
self.parameter_space = space
self.model = model
def objective(self, params):
cols = [i for i, j in params.items() if j==1]
self.model.fit(self.X_train[cols], self.y_train)
pred_proba = self.model.predict_proba(self.X_test[cols])
loss = 1 - metrics.roc_auc_score(self.y_test, pred_proba[:, 1])
return {'loss': loss, 'status': STATUS_OK}
def optimize(self, max_evals=20):
trials = Trials()
best = fmin(fn=self.objective,
space=self.parameter_space,
algo=tpe.suggest,
max_evals=max_evals,
trials=trials)
return space_eval(self.parameter_space, best)

Now create a binary choice variable for each feature in the dataset that will be fed into the class as the parameter search space.

space = {}
for col in cont_vars + cat_vars:
space[col] = hp.choice(col, [0, 1])

And finally, optimize the feature space given the class above.

hpobj = HpOptBinarySelect(
X_train[0:10000],
X_test,
y_train[0:10000],
y_test,
space,
RandomForestClassifier(),
)
best = hpobj.optimize(max_evals=100)
out = [i for i,j in best.items() if j==1]
print(f'Final number of features {len(out)}'
>>> 100%|██████████| 100/100 [02:15<00:00, 1.36s/trial, best loss: 0.13417683090083077]
>>> Final number of features 20

The feature space is now reduced by 33%. Let’s train the model with the suggested features and make sure we haven’t lost any accuracy.

model = RandomForestClassifier()
model.fit(X_train[0:10000][out], y_train[0:10000])
pred_proba = model.predict_proba(X_test[out])
auc = metrics.roc_auc_score(y_test, pred_proba[:, 1])
print(f'Hyperopt AUC: {auc:.3f}')
>>> Hyperopt AUC: 0.866

The final AUC has not changed at all even though the model seeing 10 less features than the baseline!

Conclusions

  • Hyperopt can be formulated to create optimal feature sets given an arbitrary search space of features
  • Feature selection via mathematical principals is a great tool for auto-ML and continuous training (CT)
  • Automation of feature selection via Hyperopt is easier to automate than traditional rule based feature selection techniques

Future work

In my day-to-day MLOps work, I use hyperopt A LOT for continuous training. One of my goals is to develop a robust code base that will automatically search arbitrary spaces for optimal hyperparameter, preprocessing, and feature set combinations to output the best model to a given problem. Hope I get to it soon :)

Hope you enjoyed the post and found something in here useful!

P.S: I used this technique on the same dataset with XGBoost (out of the box hyperparamers) as well and found similar results (reducing feature set by ~30% and keeping AUC at a baseline level of 0.887)

--

--