3 Phases to a Successful Machine Learning Project

I was working on a project to find an efficient and automated way to detect sensor failures on a rocket engine. Machine learning has become a popular solution to many hard problems and I wanted to deploy it to this problem. After a few trials and errors, I was excited to report to a small team of rocket scientists on a pattern that my ML models detected. After a short presentation, one engineer looked at me and said, “Congratulations, you re-discovered, the laws of physics”. My model detected a strong correlation between temperature and pressure sensors. This discovery did not need a sophisticated machine learning model to discover. A college freshman in physics or engineering would’ve known that. However, this is what happens when you just look at data streams independent of context and without spending enough time understanding the domain. Machine learning is often used as a solution looking for a problem, and that is why I developed this template to help ensure that any machine learning project has a structure to avoid these pitfalls.

Project Phases

A machine learning project could be divided into 3 major phases with some

sub-grouping for major activities

Formulation
- Problem Definition
- Data Analysis
Research & Development
- Feature Engineering
- Model Training and Evaluation
Infusion
- Productionisation Plan
- Continuous Improvement

This should not be understood to mean a waterfall approach. For example, data

analysis could lead to refining the problem definition; or, some data analysis could

be performed during the problem definition phase to better understand the

domain. Depending on how large or complex the project is, you might not need to address all the items listed below, but I highly encourage you to consider them to at least make a conscious decision as to why a certain step is inapplicable.

Problem Definition

What are we trying to do, why does it matter, and is this a machine learning problem?

The Business Case

Who are the stakeholders? (decision-makers and end-users)
How does the current solution work?
How does the current solution perform?
What is the expected impact?
What is the success criteria?

The Machine Learning Case

Is this a machine learning problem?
What data is available?
- inputs and expected outputs
- data size, format, accessibility, metadata/labels
- Is this sufficient (quality, representation, amount)
What type of ML will we consider?
- advanced statistics and data science
- supervised (classification, regression, etc.)
- unsupervised (clustering, graph structure, etc.)
- reinforcement learning
Is it batch or online learning?
- Is there a latency requirement on prediction/inference?
What metrics will be used to evaluate the algorithm performance?
What is our baseline?

Data Analysis

What has been done in industry/academia, what does the data say, and what is

needed to automate the pipeline

Industry and Literature Review

What is the state-of-the-art solution in the industry – if any?
What approaches have been tried in the academic literature and to what conclusion?

Assessment of Training Data

Representative enough?
- Generalize well to new unseen data
- Accidental correlations, sampling bias, non-response bias, etc.
Is quality high enough?
- Errors, missing values, outliers, and noise
- Heterogeneity of data distribution
Do we have labels?
- Could we generate labels from a simulation?

Exploratory Data Analysis

Correlation analysis
Sensitivity analysis
Conditional independence analysis
Spectral analysis

Feature Engineering

Feature selection: remove irrelevant features and keep the most useful among existing features
Feature extraction: combining existing features to get more useful features
- Dimensionality reduction helps discover latent variables
- Frequency domain
Feature generation; find new features by gathering more data
- Human labels or data from engineering analysis (e.g. analysis)

Model Training and Evaluation

Establish baseline benchmark, select learning algorithms, and evaluate

performance

Do you have an existing baseline you are trying to meet or exceed using a

different approach? (e.g. red lines, confidence interval, prediction error, other). If

not, multiple analytical approaches will be needed in order to establish observed

vs expected comparative data points

Algorithm Selection

Supervised Learning
- Regression
- Classification
- Recommender
- Search and Rank
- Tag
Unsupervised
- Clustering (k-means, Gaussian mixtures, etc.)
- Directed graphical models and causality
- Generative adversarial networks

Evaluation Metrics

Classification
- Precision-Recall
- ROC-AUC
- Accuracy
- Log-Loss
Regression
- MSE
- R Square
Unsupervised
- Mutual Information
- Log-likelihood

Algorithm Optimization

Loss function choices
Training error and underfitting
Test error and overfitting
K-fold cross-validation

Productionisation Plan

How will we deploy this to production and how will we maintain and monitor the

performance. Who will implement the production system and what is the relationship between the research scientists and engineering teams?

Training infrastructure and pipeline
Interference system and infrastructure
Monitoring infrastructure

Continuous Improvement

How will the feedback be collected and incorporated to continuously improve the model and the customer needs?

3 Phases to a Successful Machine Learning Project

Project Phases

Problem Definition

The Business Case

The Machine Learning Case

Data Analysis

Industry and Literature Review

Assessment of Training Data

Exploratory Data Analysis

Feature Engineering

Model Training and Evaluation

Algorithm Selection

Evaluation Metrics

Algorithm Optimization

Productionisation Plan

Continuous Improvement

Leave a comment Cancel reply

Copyright (C) Ahmed Badran

3 Phases to a Successful Machine Learning Project

Project Phases

Problem Definition

The Business Case

The Machine Learning Case

Data Analysis

Industry and Literature Review

Assessment of Training Data

Exploratory Data Analysis

Feature Engineering

Model Training and Evaluation

Algorithm Selection

Evaluation Metrics

Algorithm Optimization

Productionisation Plan

Continuous Improvement

Share this:

Leave a comment Cancel reply