My Data Science Research Canvas
Don’t Repeat Myself when starting a data science project.
I am writing this post to remind myself about the list of items that I need to step through when starting a data science project.
Define the problem
You can’t solve a problem until you have defined it.
Far better an approximate answer to the right question, which is often vague, than the exact answer to the wrong question, which can always be made precise. — John Tukey
Knowing what is possible will also save you time down the track. It is practically impossible to predict churn when they all cancel their subscription as soon as they finished the signup.
The following steps are steps I would take to formulate and refine the problem before I dive into building a model. I will use churn as an example and provide some of the resources I found useful during the research.
Identify existing business applications, practices and resources :
I often start by identifying existing business applications and practices, the first step does not need to start with an ML model.
This phase often helps me to identify:
- How people think about churn. Churn is not a single problem, it can result from natural usage, churn from bad product-market fit, poor experience etc.
- What is the industry standard fo churn? Are we doing well?
Here are some of the examples of research that I found while researching for churn.
What is good retention — Issue 29
Identify ML practices in these the area
The next research step focuses more on the ML practice, how are other data scientists tackling the problem.
- How people formulate the problem. Take the churn prediction task, you can formulate it as a binary classification, or detecting anomaly behaviours in product usage prior to churn.
- Identify a starting point and benchmark metric.
- Common problems faced by others. (Data leakage is one of the most common difficulties for curating a dataset for churn prediction.)
Customer Churn Prediction Using Machine Learning: Main Approaches and Models
Games and Big Data: A Scalable Multi-Dimensional Churn Prediction Model
Why you should stop predicting customer churn and start using uplift models
Public dataset and analysis
Another thing I found quite helpful is to take a look at the public dataset and analysis on a platform like Kaggle. They give ideas of how the data can be curated and also issues associated.
Telecom Customer Churn Prediction
Bank Customer Churn Prediction
Perform an Exploratory Data Analysis
This ties together the research in the above steps and how it fits with the specific problem you are dealing with.
Aftermaths
At the end of this particular project, although I did end up building a churn model as requested. What was most valuable was the insight into the different pathway of how a customer can churn.
Giving a binary or probability a customer will churn to stakeholder is of very limited value as they don’t understand why they churn and they are limited to generic approach such as price discount.
On the other hand, providing segments of the customer by their usage and then identify the point of intervention provides marketers and stakeholders more power to design solutions for each individual scenario.
The retention strategy for a streaming platform user who returns every day at 8 pm will be very different from a customer who binges 10 series in 3 days and then goes dormant.