Categorical and Numeric Data in Scikit-Learn Pipelines

I always tend to organize every aspect of my experiments with organizers as useful as Pipeline. However, one shouldn't be passing continuous variables into a OneHotEncoder or vice versa for Scalers. The solution is, split your data, treat them in separate pipelines before merging them together again. Inspired by Scikit Learn Examples.

Identifying the Columns / Features

The first step, definitely, is to identify which columns are categorical and which are numeric.

numeric_features = ['salary', 'zone_count', 'staff_count']
categorical_features = ['rank', 'district']

I found a super cool way to achieve this automatically (I forgot the source, will mention it once I find it).

categorical_feature_mask = df.dtypes==object
categorical_features = df.columns[categorical_feature_mask].tolist()

numeric_feature_mask = df.dtypes!=object
numeric_features = df.columns[numeric_feature_mask].tolist()

This again works on the belief that categorical features are not being represented by numbers. However, often numbers can be categorical features! Be careful while using this neat trick and do consider whether all your apparently numerical features are numeric after all!

Setting up Numeric and Categorical Pipelines

First, I am setting up my pipeline for the categorical data I have. (Yeah with one step!)

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore')),
])

Next up, let's setup the pipeline for our numeric values.

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
])

And we are done with this phase. Let's move on and combine these two pipelines!

The Great Join

We'll be using a ColumnTransformer for this bit. Let's combine these transformation pipelines.

from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

Awesome! And we have the separated pipelines being used from a ColumnTransformer now.

Constructing the Main Pipe

Now, the part you've been waiting for, actually using everything we've done to feed into a Classifier, Regressor or something else.

from sklearn.ensemble import RandomCityClassifier

clf = Pipeline([
     ('preprocessor', preprocessor),
     ('clf', RandomCityClassifier())
])

Now use the pipeline as usual, fit your data, predict on test data, do benchmarks, and maybe deploy!

Thanks for reading through! If you liked this simplified and broken down explanation, please do not forget to share it with your friends. Why not leave your thoughts in the comments below?