Personal Blog

I was working on a project to find an efficient and automated way to detect sensor failures on a rocket engine. Machine learning has become a popular solution to many hard problems and I wanted to deploy it to this problem.  After a few trials and errors, I was excited to report to a small team of rocket scientists on a pattern that my ML models detected. After a short presentation, one engineer looked at me and said, “Congratulations, you re-discovered, the laws of physics”.  My model detected a strong correlation between temperature and pressure sensors.  This discovery did not need a sophisticated machine learning model to discover.  A college freshman in physics or engineering would’ve known that. However, this is what happens when you just look at data streams independent of context and without spending enough time understanding the domain.  Machine learning is often used as a solution looking for a problem, and that is why I developed this template to help ensure that any machine learning project has a structure to avoid these pitfalls.

Project Phases

A machine learning project could be divided into 3 major phases with some

sub-grouping for major activities

  1. Formulation
    • Problem Definition
    • Data Analysis
  2. Research & Development
    • Feature Engineering
    • Model Training and Evaluation
  3. Infusion
    • Productionisation Plan
    • Continuous Improvement

This should not be understood to mean a waterfall approach. For example, data

analysis could lead to refining the problem definition; or, some data analysis could

be performed during the problem definition phase to better understand the

domain.  Depending on how large or complex the project is, you might not need to address all the items listed below, but I highly encourage you to consider them to at least make a conscious decision as to why a certain step is inapplicable.

Problem Definition

What are we trying to do, why does it matter, and is this a machine learning problem?

The Business Case

  • Who are the stakeholders? (decision-makers and end-users)
  • How does the current solution work?
  • How does the current solution perform?
  • What is the expected impact?
  • What is the success criteria?

The Machine Learning Case

  • Is this a machine learning problem?
  • What data is available?
    • inputs and expected outputs
    • data size, format, accessibility, metadata/labels
    • Is this sufficient (quality, representation, amount)
  • What type of ML will we consider?
    • advanced statistics and data science
    • supervised (classification, regression, etc.)
    • unsupervised (clustering, graph structure, etc.)
    • reinforcement learning
  • Is it batch or online learning?
    • Is there a latency requirement on prediction/inference?
  • What metrics will be used to evaluate the algorithm performance?
  • What is our baseline?

Data Analysis

What has been done in industry/academia, what does the data say, and what is

needed to automate the pipeline

Industry and Literature Review

  • What is the state-of-the-art solution in the industry – if any?
  • What approaches have been tried in the academic literature and to what conclusion?

Assessment of Training Data

  • Representative enough?
    • Generalize well to new unseen data
    • Accidental correlations, sampling bias, non-response bias, etc.
  • Is quality high enough?
    • Errors, missing values, outliers, and noise
    • Heterogeneity of data distribution
  • Do we have labels?
    • Could we generate labels from a simulation?

Exploratory Data Analysis

  • Correlation analysis
  • Sensitivity analysis
  • Conditional independence analysis
  • Spectral analysis

Feature Engineering

  • Feature selection: remove irrelevant features and keep the most useful among existing features
  • Feature extraction: combining existing features to get more useful features
    • Dimensionality reduction helps discover latent variables
    • Frequency domain
  • Feature generation; find new features by gathering more data
    • Human labels or data from engineering analysis (e.g. analysis)

Model Training and Evaluation

Establish baseline benchmark, select learning algorithms, and evaluate

performance

Do you have an existing baseline you are trying to meet or exceed using a

different approach? (e.g. red lines, confidence interval, prediction error, other). If

not, multiple analytical approaches will be needed in order to establish observed

vs expected comparative data points

Algorithm Selection

  • Supervised Learning
    • Regression
    • Classification
    • Recommender
    • Search and Rank
    • Tag
  • Unsupervised
    • Clustering (k-means, Gaussian mixtures, etc.)
    • Directed graphical models and causality
    • Generative adversarial networks

Evaluation Metrics

  • Classification
    • Precision-Recall
    • ROC-AUC
    • Accuracy
    • Log-Loss
  • Regression
    • MSE
    • R Square
  • Unsupervised
    • Mutual Information
    • Log-likelihood

Algorithm Optimization

  • Loss function choices
  • Training error and underfitting
  • Test error and overfitting
  • K-fold cross-validation

Productionisation Plan

How will we deploy this to production and how will we maintain and monitor the

performance. Who will implement the production system and what is the relationship between the research scientists and engineering teams?

  • Training infrastructure and pipeline
  • Interference system and infrastructure
  • Monitoring infrastructure

Continuous Improvement

How will the feedback be collected and incorporated to continuously improve the model and the customer needs?

Leave a comment