Skip to main content

Data Science Methodology- A complete Overview

The people who work in Data Science and are busy finding the answers for different questions every day comes across the Data Science Methodology. Data Science Methodology indicates the routine for finding solutions to a specific problem. This is a cyclic process that undergoes a critic behaviour guiding business analysts and data scientists to act accordingly.
  1. Business Understanding:
    Before solving any problem in the Business domain it needs to be understood properly. Business understanding forms a concrete base, which further leads to easy resolution of queries. We should have the clarity of what is the exact problem we are going to solve.
  2. Analytic Understanding:
    Based on the above business understanding one should decide the analytical approach to follow. The approaches can be of 4 types: Descriptive approach (current status and information provided), Diagnostic approach(a.k.a statistical analysis, what is happening and why it is happening), Predictive approach(it forecasts on the trends or future events probability) and Prescriptive approach( how the problem should be solved actually).
  3. Data Requirements:
    The above chosen analytical method indicates the necessary data content, formats and sources to be gathered. During the process of data requirements, one should find the answers for questions like ‘what’, ‘where’, ‘when’, ‘why’, ‘how’ & ‘who’.
  4. Data Collection:
    Data collected can be obtained in any random format. So, according to the approach chosen and the output to be obtained, the data collected should be validated. Thus, if required one can gather more data or discard the irrelevant data.
  5. Data Understanding:
    Data understanding answers the question “Is the data collected representative of the problem to be solved?”. Descriptive statistics calculates the measures applied over data to access the content and quality of matter. This step may lead to reverting the back to the previous step for correction.
  6. Data Preparation:
    Let’s understand this by connecting this concept with two analogies. One is to wash freshly picked vegetables and second is only taking the wanted items to eat in the plate during the buffet. Washing of vegetables indicates the removal of dirt i.e. unwanted materials from the data. Here noise removal is done. Taking only eatable items in the plate is, if we don’t need specific data then we should not consider it for further process. This whole process includes transformation, normalization etc.
  7. Modelling:
    Modelling decides whether the data prepared for processing is appropriate or requires more finishing and seasoning. This phase focusses on the building of predictive/descriptive models.
  8. Evaluation:
    Model evaluation is done during model development. It checks for the quality of the model to be assessed and also if it meets the business requirements. It undergoes diagnostic measure phase (the model works as intended and where are modifications required) and statistical significance testing phase (ensures about proper data handling and interpretation).
  9. Deployment:
    As the model is effectively evaluated it is made ready for deployment in the business market. Deployment phase checks how much the model can withstand in the external environment and perform superiorly as compared to others.
  10. Feedback:
    Feedback is the necessary purpose which helps in refining the model and accessing its performance and impact. Steps involved in feedback define the review process, track the record, measure effectiveness and review with refining.
After successful abatement of these 10 steps, the model should not be left untreated, rather based on the feedbacks and deployment appropriate update should be made. As new technologies emerge, new trends should be reviewed so that the model continually provides value to solutions.

Comments

Popular posts from this blog

Most Used Algorithm by DataScientist

We will discuss mostly machine learning algorithms that are important for data scientists and classify them based on supervised and unsupervised roles. I will provide you an outline for all the important algorithms that you can deploy for improving your data science operations. Here is the list of top Data Science Algorithms that you must know to become a data scientist. Let’s start with the first one – 1. Linear Regression Linear Regression is a method of  measuring the relationship between two continuous variables . The two variables are – Independent Variable – “x” Dependent Variable – “y” In the case of a simple linear regression, the independent value is the predictor value and it is only one. The relationship between x and y can be described as: y = mx + c Where m is the slope and c is the intercept. Based on the predicted output and the actual output, we perform the calculation 2. Logistic Regression Logistic Regression is used for binary classificat...

Data Analytics Interview Questions - Part 1

Q1. Python or R – Which one would you prefer for text analytics? We will prefer Python because of the following reasons: Python  would be the best option because it has Pandas library that provides easy to use data structures and high-performance data analysis tools. R  is more suitable for machine learning than just text analysis. Python performs faster for all types of text analytics. Q2. How does data cleaning plays a vital role in the analysis? Data cleaning can help in analysis because: Cleaning data from multiple sources helps to transform it into a format that data analysts or data scientists can work with. Data Cleaning helps to increase the accuracy of the model in machine learning. It is a cumbersome process because as the number of data sources increases, the time taken to clean the data increases exponentially due to the number of sources and the volume of data generated by these sources. It might take up to 80% of the time for just c...

Data Science Interview Questions -Part 2

1) What are the differences between supervised and unsupervised learning? Supervised Learning Unsupervised Learning Uses known and labeled data as input Supervised learning has a feedback mechanism  Most commonly used supervised learning algorithms are decision trees, logistic regression, and support vector machine Uses unlabeled data as input Unsupervised learning has no feedback mechanism  Most commonly used unsupervised learning algorithms are k-means clustering, hierarchical clustering, and apriori algorithm 2) How is logistic regression done? Logistic regression measures the relationship between the dependent variable (our label of what we want to predict) and one or more independent variables (our features) by estimating probability using its underlying logistic function (sigmoid). The image shown below depicts how logistic regression works: The formula and graph for the sigmoid function is as shown: 3) Explain the steps in making a deci...