Skip to main content

How to deal with missing values in data cleaning

The data you inherit for analysis will come from multiple sources and would have been pulled adhoc. So this data will not be immediately ready for you to run any kind of model on. One of the most common issues you will have to deal with is missing values in the dataset. There are many reasons why values might be missing - intentional, user did not fill up, online forms broken, accidentally deleted, legacy issues etc. 

Either way you will need to fix this problem. There are 3 ways to do this - either you will ignore the missing values, delete the missing value rows or fill the missing values with an approximation.

Its easiest to just drop the missing observations but you need to very careful before you do that, because the absence of a value might actually be conveying some information about the data pattern. If you decide to drop missing values :

df_no_missing = df.dropna()

will drop any rows with any value missing. Even if some values are available in a row it will still get dropped even if a single value is missing. 

df_cleaned = df.dropna(how='all')

will only drop rows where all cells are NA or missing values. To drop columns, you will have to add the ‘axis=1’ parameter to the above functions.

The extent of the missing values is identified after identifying the variables with missing values. If any patterns are identified the analyst has to concentrate on them as it could lead to interesting and meaningful business insights. If there are no patterns identified, then the missing values can be substituted with mean or median values (imputation) or they can simply be ignored.There are various factors to be considered when answering this question-

Understand the problem statement, understand the data and then give the answer.Assigning a default value which can be mean, minimum or maximum value. Getting into the data is important.

If it is a categorical variable, the default value is assigned. The missing value is assigned a default value.

If you have a distribution of data coming, for normal distribution give the mean value.

Should we even treat missing values is another important point to consider? If 80% of the values for a variable are missing then you can answer that you would be dropping the variable instead of treating the missing values.

Comments

  1. Thank you for sharing such a useful article. It will be useful to those who are looking for knowledge. Continue to share your knowledge with others through posts like these, and keep posting on
    Data Engineering Services 

    ReplyDelete

Post a Comment

Popular posts from this blog

Data Science Methodology- A complete Overview

The people who work in Data Science and are busy finding the answers for different questions every day comes across the Data Science Methodology. Data Science Methodology indicates the routine for finding solutions to a specific problem. This is a cyclic process that undergoes a critic behaviour guiding business analysts and data scientists to act accordingly. Business Understanding: Before solving any problem in the Business domain it needs to be understood properly. Business understanding forms a concrete base, which further leads to easy resolution of queries. We should have the clarity of what is the exact problem we are going to solve. Analytic Understanding: Based on the above business understanding one should decide the analytical approach to follow. The approaches can be of 4 types: Descriptive approach (current status and information provided), Diagnostic approach(a.k.a statistical analysis, what is happening and why it is happening), Predictive approach(it forecasts on...

Data is the New oil of Industry?

Let's go back to 18th century ,when development was taking its first footstep.The time when oil was considered to be the subset of industrial revolution. Oil than tends to be the most valuable asset in those time. Now let's come back in present. In 21st century, data is vigorously called the foundation of information revolution. But the question that arises is why are we really calling data as the new oil. Well for it's explanation Now we are going to compare Data Vs Oil Data is an essential resource that powers the information economy in much the way that oil has fueled the industrial economy. Once upon a time, the wealthiest were those with most natural resources, now it’s knowledge economy, where the more you know is proportional to more data that you have. Information can be extracted from data just as energy can be extracted from oil. Traditional Oil powered the transportation era, in the same way that Data as the new oil is also powering the emerging transportation op...

Future of Data Science

It is rightly said that Data Scientists would be shaping the future of the businesses in the years to come. And trust me they are already on their path to do so. Over the years, data is constantly being generated and collected as well. Now, the field of data sciences has put this humongous pile of data to good use. Now, data can be collected, processed, analyzed and converted into a highly useful piece of information that would benefit the businesses with better and well-informed decision-making capability. "Data is a Precious Thing and will Last Longer than the Systems themselves." Also, Vinod Khosla, an American Billionaire Businessman and Co-founder of Sun Microsystems declared – "In the next 10 years, Data Science and Software will do more for Medicines than all of the Biological Sciences together." By the above two statements, it is clear that data proliferation will never end and because of that, the use of data related technologies like Data Science and Big D...