Feature Engineering¶

Estimated time to read: 5 minutes

Feature engineering transforms raw data into a structured format that machine learning algorithms can better understand and process. In other words, it involves creating and selecting the most relevant features (also called variables or attributes) from the raw data to improve the performance of machine learning models.

Feature engineering aims to extract valuable information from the data, reduce noise, and represent it in a way that makes it easier for algorithms to learn from. It plays a crucial role in building an effective machine learning model, as the quality of the features directly impacts the model's performance.

Feature engineering can involve several techniques, such as:

Feature extraction¶

Extracting new, meaningful features from the raw data. For example, creating interaction terms, polynomial features, or decomposing timestamps into separate components like a day of the week, month, or year.

Interaction Term Synthesis: Creating new features by multiplying or dividing existing variables. This captures non-obvious relationships between features that are not apparent when analysed individually.

Polynomial Expansion: Generating new features by raising existing variables to a power. This is vital for capturing non-linear relationships in datasets that standard linear models might miss.

Temporal Deconstruction: Breaking down high-resolution timestamps into separate components (e.g., day of week, month, peak/off-peak). This reveals vital patterns and seasonality within the data.

Feature scaling¶

Scaling features to a common scale, so that their values can be compared fairly. Common scaling techniques include min-max scaling, standardisation (z-score), and normalisation.

Min-Max Range Scaling: Scaling features to a fixed range, typically [0, 1]. This ensures all features contribute equally to the model, although it remains sensitive to extreme outliers.

Z-Score Standardisation: Scaling features by subtracting the mean and dividing by the standard deviation. This creates a distribution with zero mean and unit variance, making the model more resilient to outliers.

L2-Norm Normalisation: Scaling features by dividing each value by its Euclidean length. This is essential when features have different units or widely varying magnitudes.

Feature selection¶

Identifying the most important features for the model by either eliminating irrelevant or redundant features or selecting a subset of features that contribute the most to the model's performance. Feature selection techniques include filter, wrapper, and embedded methods.

Statistical Filter Methods: Selecting features based on their relationship with the target variable using metrics like correlation or mutual information. This is computationally efficient for high-dimensional datasets.

Algorithmic Wrapper Methods: Evaluating feature subsets by training specific models (e.g., Recursive Feature Elimination). While computationally expensive, these methods often yield the highest predictive performance.

Integrated Embedded Methods: Selecting features during the model training process itself (e.g., LASSO regularisation or Random Forest feature importance), ensuring that only the most relevant signal is preserved.

Feature transformation¶

Applying mathematical transformations to the features to achieve a more desirable distribution or relationship with the target variable. Examples include log transformation, square root transformation, and power transformation.

Logarithmic Variance Stabilisation: Applying a natural logarithm to features to reduce the impact of outliers and stabilise variance across skewed distributions.

Square Root Compression: Transforming features by taking the square root to capture non-linear relationships and dampen the effect of high-value outliers.

Parametric Power Transformations: Using techniques like Box-Cox or Yeo-Johnson to stabilise variance and bring the data distribution closer to normality.

Handling missing values¶

Filling in missing values in the data using various strategies like imputation, deletion, or interpolation.

Statistical Value Imputation: Filling missing data with statistical measures like the mean, median, or mode. This preserves the dataset size while minimizing the introduction of bias.

Listwise Case Deletion: Removing entire rows containing missing values. This is only recommended when data loss is minimal and missingness is completely at random.

Algorithmic Value Interpolation: Estimating missing values based on neighbouring instances, vital for preserving the integrity of time-series data.

Handling categorical variables¶

Converting categorical variables into numerical values using one-hot, label, or target encoding techniques.

One-Hot Sparse Encoding: Creating binary flags for each category in a variable. This is optimal for distance-based models and low-cardinality categories.

Ordinal Label Encoding: Assigning unique integers to categories. While efficient for tree-based models, it can introduce an artificial hierarchy that requires careful monitoring.

Bayesian Target Encoding: Replacing categories with the mean of the target variable. This is powerful for high-cardinality features but requires strict regularisation to avoid data leakage.

Feature engineering for specific algorithms¶

Adapting features to the requirements of specific machine learning algorithms, such as creating polynomial features for linear regression or distance-based features for clustering algorithms.

Linear Model Augmentation: Creating interaction terms or polynomial expansions to capture complex signal that a base linear model would fail to isolate.

Distance-Based Metric Refinement: Engineering features (e.g., Euclidean or Cosine distance) to improve the performance and separation accuracy of clustering algorithms.

Domain-specific feature engineering¶

In some cases, domain knowledge can be used to create more relevant features that better capture the underlying patterns in the data. For example:

NLP Semantic Extraction: Utilising tokenisation, lemmatisation, and n-grams to transform raw text into high-dimensional vectors (e.g., TF-IDF or BERT embeddings).

Computer Vision Pre-processing: Employing surgical techniques like histogram equalisation, cropping, and CNN-based feature extraction to isolate vital visual patterns.

Time Series Signal Engineering: Creating lagged variables and rolling windows to reveal underlying trends, signals, and temporal seasonality.

By carefully engineering features, data scientists can improve the performance and interpretability of machine learning models and ultimately derive more value from the data.