I always tend to organize every aspect of my experiments with organizers as useful as
Pipeline. However, one shouldn't be passing continuous variables into a
OneHotEncoder or vice versa for Scalers. The solution is, split your data, treat them in separate pipelines before merging them together again. Inspired by Scikit Learn Examples.
Identifying the Columns / Features
The first step, definitely, is to identify which columns are categorical and which are numeric.
numeric_features = ['salary', 'zone_count', 'staff_count']
categorical_features = ['rank', 'district']
I found a super cool way to achieve this automatically (I forgot the source, will mention it once I find it).
categorical_feature_mask = df.dtypes==object
categorical_features = df.columns[categorical_feature_mask].tolist()
numeric_feature_mask = df.dtypes!=object
numeric_features = df.columns[numeric_feature_mask].tolist()
This again works on the belief that categorical features are not being represented by numbers. However, often numbers can be categorical features! Be careful while using this neat trick and do consider whether all your apparently numerical features are numeric after all!
Setting up Numeric and Categorical Pipelines
First, I am setting up my pipeline for the categorical data I have. (Yeah with one step!)
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
categorical_transformer = Pipeline(steps=[
Next up, let's setup the pipeline for our numeric values.
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
numeric_transformer = Pipeline(steps=[
And we are done with this phase. Let's move on and combine these two pipelines!
The Great Join
We'll be using a
ColumnTransformer for this bit. Let's combine these transformation pipelines.
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer(
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
Awesome! And we have the separated pipelines being used from a
Constructing the Main Pipe
Now, the part you've been waiting for, actually using everything we've done to feed into a Classifier, Regressor or something else.
from sklearn.ensemble import RandomCityClassifier
clf = Pipeline([
Now use the pipeline as usual, fit your data, predict on test data, do benchmarks, and maybe deploy!
Thanks for reading through! If you liked this simplified and broken down explanation, please do not forget to share it with your friends. Why not leave your thoughts in the comments below?