Fault Detection with Statistical Process Control

4 minute read

Published:

This post gives a basic introduction to fault detection with statistical process control (SPC), also referred to as statistical process monitoring (SPM).

What is fault detection?

In many processes, such as industrial, manufacturing, chemical, water/wastewater treatment, etc., it is important to identify and quickly respond to any abnormal changes, outliers, or anomalies (“faults”) in the process that may damage process equipment or negatively impact environmental or human health.

Some examples of faults include:

  • Sensor failure
  • Unexpected changes in process behavior
  • Degradation of sensors or mechanical devices over time

Fault detection is the process by which faults are identified (“detected”). Many methods for fault detection exist, including various supervised and unsupervised machine learning methods. However, these methods can often be difficult to fit or interpret, especially when a large amount of data is not available in advance. Another approach is to use SPC methods, which are a more traditional statistical solution that are typically much easier, faster, and more interpretable, often without sacrificing much (if any) fault detection performance.

What is SPC?

Statistical process control, or SPC, is a data-driven approach than can be used to detect faults in real-time. Instead of solely relying on extensive domain expertise, SPC methods can be applied to assist with identifying potential points in time when outliers may have occurred or the process may be “out-of-control”, meaning that the process no longer is operating as it should be.

SPC was largely introduced and initially developed by Walter A. Shewhart to monitor the manufacturing of military munitions. However, applications of SPC are widespread and can be used in a variety of settings to determine if a process is performing as expected or not. This provides valuable information for process and quality improvement and can help cut costs, decrease interruptions in the process, etc.

How does SPC work?

To understand SPC, it’s important to first define some terms. In most SPC applications, we typically define two phases:

Phase I: What does normal operating behavior look like?

  • Examine historical data, decide if the process is stable (in-control)
  • Use historical data to mathematically define what normal operating behavior looks like

Phase II: Does the process look different from normal operating behavior?

  • Determine if new observations are in-control or out-of-control

We can often think of Phase I as “building” or “training” our model and Phase II as “deploying” or “testing” the model to make decisions as new data is collected.

Usually, SPC is used to monitor two aspects of a process: the mean (average) value of a variable, or the variability of the variable.

Example

A simple example of using SPC to monitor the mean value of a single variable of interest is shown in the figure below, which comes from this website, which has some other links and basic useful information about SPC:

SPC example for monitoring the mean of a single variable (https://www.statisticshowto.com/statistical-process-control/).

The figure above shows values being plotted at different points in time connected by lines. Important components of the plot are the

  • average values of the variable of interest over time (plotted with squares and connected by lines)
  • center line (CL) (middle of the plot)
  • upper and lower control limits (UCL and LCL) (dashed lines)

The center line is often the mean value of Phase I (historical) data and represents what the mean of the variable should be under normal operating conditions.

The control limits are often set a certain number of standard deviations (of the Phase I data) above and below the center line. The control limits represent acceptable limits of the mean value of the variable under normal operating conditions.

In the upper right of the plot, there are some points in red that exceed the upper control limit. These points indicate observations that are abnormal from what we would expect under normal operating conditions. Whenever a point exceeds a control limit, it is important to look at the process and determine what caused the exceedance. If values continue to exceed the control limit, the process could be negatively affected.