8 Steps and 60 Activities to Complete a Data Science Project

4 min readMar 15, 2021

It’s hard to know where to start or to dive into the fascinating world of data and AI. The first time if you work on a data science project, there’s no clear vision, a sense of an unclear pathway in regards to the necessary steps of what it takes to do a complete analysis and to complete the data science project.

In ELMO, we have the following systematic and structured approach for data scientists that help to maximize chances of success in a data science project at the lowest cost. To succeed, the eight major steps (Fig.1) every data scientist needs to go through from raw data preparation to building an ML model and ultimately to ML operation (MLOps).

Fig. 1 — Data Science Life Cycle — Predictive Data Analytics Project at ELMO Software

Step 1: Business Understanding

The first thing you have to do before solve a problem is to define exactly what it is. Understanding the business requirements and strategies is the key to ensuring its success and the first step of any data science project. At this stage, you should be clear with the objectives, goal of the project and able to translate questions into something actionable.

Step 2: Analytics Approach

Once the business problem and goal are clearly stated, the data scientist can define the analytical approach to solve the problem. This includes statistical approaches, machine learning techniques and determines the probabilities for the predictive model to achieve the desired result.

Step 3: Data Understanding

In this stage, you need to identify various sources (e.g. Amazon S3, Delta Lake, databases, CSV) to collect all data needed (structure or semi-structure) to solve the predictive analytics problem. You may use Apache Airflow, Airbytes to collect and transfer data to the analytics platform such as Databricks. Data scientist then visualize the data to understand, assess its quality and obtain initial information about the data. If any gaps found, may necessarily need to review data requirements and collect more data.

Step 4: Data Preparation

Data can be in any format. To analyze it, data have to be clean, prepare in a certain format, handling missing values, remove duplicates, unnecessary and inconsistent data. In addition, identify outliers, anomalies, derived variables, data aggregation, prepare quality data (enrich data) and many various activities perform during data preparation. Tensorflow data validation (TFDV) could be a good tool. This stage is considered to be one of the most time-consuming steps in the data science project. This also a crucial step in order to prevent the wrong prediction.

Step 5: Exploratory Data Analysis

Now you have a nice dataset. It’s time for exploring and building graphs. Exploratory data analysis (EDA) plays a very important role in data analytics. The difficulty here is not coming up with ideas, it is coming up with ideas that are likely to turn into useful insights, finding some interesting patterns, trends and study the behaviour of the data that can help explain the business problem like why high performing employees are leaving the company. You can use any data visualization tool such as Superset, Mixpanel, Tableau, BI, or databricks.

Step 6: Modelling

Modelling focuses on developing either predictive or descriptive models when the real fun starts. Machine learning (ML) algorithms can help you go a step further into getting insights and by analyzing past data, it can predict future trends. For predictive modelling, data scientists use 70–80% data as a training set and fit it into the Keras TensorFlow model. Before data fit into the model, you need to perform imputation, sample distribution, feature engineering, outliers, normalization and feature scaling.

Step 7: Evaluation

Finally, data scientists evaluate the model using 20% data as a test set and 10% data as a validation set. The model should be validated through classification report, F1 score, Kappa score, confusion matrix, AUC and ROC. At this stage, you must try to improve the efficiency of the predictive model so that it can make more accurate predictions. The model needs to be optimized if there are any issues.

Step 8: Deployment

The predictive model must not sit on the shelf; it needs to be operationalized. The end goal is to deploy the ML model into the production environment for final user acceptance. Deploying the ML model is vital for any organization and for you to realize the full benefits of your data science efforts. MLflow or TFX, kubeflow could be used to deploy ML models. The client must validate the performance of the model and the model needs to be optimized if there are any issues.

The following Fig. 2 shows the list of activities involves in each step: https://miro.com/app/board/o9J_lPCHYtQ=/

Once you have mastered all the steps of the data science lifecycle, why not automate the process and save even more time? Thank you for reading. This is continual, please stay tuned for more…

8 Steps and 60 Activities to Complete a Data Science Project

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Mohammad Nuruzzaman

No responses yet