Only a very small percentage of customers will churn. Manufacturing machines will malfunction perhaps twice a year. Only 0.1 % of transactions will be fraudulent. Yet, it is too costly to ignore these rare incidents. The churn rate may rapidly increase; a single out-of-spec lot may cost thousands of dollars; any undetected fraudulent attempt will not only cost money but also the trust of customers. As such, being able to predict these “rare but costly” incidents could bring tremendous benefits, and improve total quality and performance of an organisation.
Machine Learning Use Cases in Industry
These cases are quite common in the world of machine learning, especially when applied to the industry domain. In fact, they have one common property which we call the “class imbalance problem” - the total number of anomalous data points is far less than the total number of normal data points. At Evolusys, we observe that the class imbalance problem can arise in many industrial use cases including, but not limited to:
- Factory & Demand-Side analytics and optimization
- Improving preventative maintenance
- Repair and Overhaul (MRO)
- Predictive maintenance
- Factor analysis - Quality control
- Supply-chain optimization
Figure 1. Applying machine learning in industry, and types of data ingested in these applications
In this post, we emphasize the “Factor analysis – Quality control” branch, where we demonstrate our expertise in data engineering and machine learning: we overcome the class imbalance problem and deliver insights on the rare but costly incidents.
Case Study: An example from the Pharma Industry
Pharma Corporation ABC has manufacturing machines which produce medicine. They use Indicator-XYZ as the main quality indicator for these medicine. But for some previously unknown reasons the Indicator-XYZ value can be out of specifications for certain lots, with no advanced warning from previous manufacturing steps. Despite that only 0.1% of the lots are out of specifications, it is too costly for ABC to ignore them.
In this project, Evolusys has engaged in answering the following core question: Can an identified difference in behavior of the manufacturing machines be explained by process parameters ?
Main objectives were:
- Perform exploratory analysis on the data provided by Pharma Corporation ABC;
- Construct hypotheses on the exploration results to explain the characteristic of out-of-specification lots;
- Build machine learning models to validate the hypotheses;
We focused on:
- Setting up an Azure HDInsight cluster and an Azure Data Lake storage;
- Setting up a suite of tools, notably Spark, to perform analysis on the data;
- Performing data transformation to speed up the data analysis process;
- Implementing time series analysis and machine learning methodologies to conduct the analysis;
In our last step, we have paid particular attention to:
- Choosing a machine learning model that is robust against class imbalance;
- A deliberate partitioning of data so that we preserve the proportion of anomalous and normal data points in our training and test sets.
Figure 2. Our proposed architecture
- We transformed and prepared the data for distributed processing.
- We formulated hypotheses in a data-driven manner.
- We built a very powerful supervised learning model to validate our hypotheses.
Using this model, we identified a subset of manufacturing components whose malfunctions influence the quality of the lots. We also identified the relative importance of these features in predicting the quality of the lots.
Future directions include the deployment of a real-time predictive analytics system based on our findings.
The Class Imbalance problem is very common in the world of machine learning. In this post we covered one example case where we addressed this problem. Our underlying methodology can be easily adapted to other use cases such as churn analysis, anomaly detection, and fraudulent transactions.
At Evolusys, we are dedicated to tailor the best machine learning solutions for our clients.