A beginner’s Guide to Feature Engineering Part 1

2 min readJan 31, 2020

What is feature engineering?

Feature engineering is the process of using domain knowledge of the data to add, delete or combine features with the intent to improve the performance of the machine learning algorithm.

Why do we want to do feature engineering?

We have background knowledge on the subject and know that adding specific features will help the model.
We do not have background knowledge but through exploratory data analysis (EDA) we gain insights and decide to test new features. For example, while plotting histograms we may notice that a feature has several humps and we may decide to break these into bins.

When does feature engineering happen?

During the modeling process. Classifiers such as SVM with a polynomial or Gaussian kernel are structured in a way that they capture interactions between features so we do not need to create these features manually.
Before modeling. We may decide to drop some features that we nelieve not to be useful. We can create a new feature such as the ratio between two features, combine features that we believe to be highly correlated or we may decide to get the feature from an external database, transform it to look adequate for the model and then add it into our dataset (ETL — extract, transform, load).

If the approach taken for feature engineering is the second one, one needs to be careful to not leak information about the relationship between the features and the target variable because that can lead to overfitting. We also need to be careful to not introduce relationships that were not previously present in the raw dataset. This can also lead to overfitting!

How to see if the features we are adding add value to our learning algorithm?

We need to have a test set and a training set. If by adding a new feature to the training set, we see that the metrics on the test set are not improving or rather that they are getting worst, we are most likely overfitting.

What types of feature engineering can we do to our data?

Feature selection and data reduction
Feature scaling
Discretization
Categorical Encoding
Feature construction
Target manipulations

These will be discussed in more detail in my upcoming blog posts!