Skip to main content

How to deal with missing values in data cleaning

The data you inherit for analysis will come from multiple sources and would have been pulled adhoc. So this data will not be immediately ready for you to run any kind of model on. One of the most common issues you will have to deal with is missing values in the dataset. There are many reasons why values might be missing - intentional, user did not fill up, online forms broken, accidentally deleted, legacy issues etc. 

Either way you will need to fix this problem. There are 3 ways to do this - either you will ignore the missing values, delete the missing value rows or fill the missing values with an approximation.

Its easiest to just drop the missing observations but you need to very careful before you do that, because the absence of a value might actually be conveying some information about the data pattern. If you decide to drop missing values :

df_no_missing = df.dropna()

will drop any rows with any value missing. Even if some values are available in a row it will still get dropped even if a single value is missing. 

df_cleaned = df.dropna(how='all')

will only drop rows where all cells are NA or missing values. To drop columns, you will have to add the ‘axis=1’ parameter to the above functions.

The extent of the missing values is identified after identifying the variables with missing values. If any patterns are identified the analyst has to concentrate on them as it could lead to interesting and meaningful business insights. If there are no patterns identified, then the missing values can be substituted with mean or median values (imputation) or they can simply be ignored.There are various factors to be considered when answering this question-

Understand the problem statement, understand the data and then give the answer.Assigning a default value which can be mean, minimum or maximum value. Getting into the data is important.

If it is a categorical variable, the default value is assigned. The missing value is assigned a default value.

If you have a distribution of data coming, for normal distribution give the mean value.

Should we even treat missing values is another important point to consider? If 80% of the values for a variable are missing then you can answer that you would be dropping the variable instead of treating the missing values.

Comments

  1. Thank you for sharing such a useful article. It will be useful to those who are looking for knowledge. Continue to share your knowledge with others through posts like these, and keep posting on
    Data Engineering Services 

    ReplyDelete

Post a Comment

Popular posts from this blog

15 Common questions for Machine Learning...!!

  1. What is logistic regression? Logistic regression is a machine learning algorithm for classification. In this algorithm, the probabilities describing the possible outcomes of a single trial are modelled using a logistic function. 2. What is the syntax for logistic regression? Library: sklearn.linear_model.LogisticRegression Define model: lr = LogisticRegression() Fit model: model = lr.fit(x, y) Predictions: pred = model.predict_proba(test) 3. How do you split the data in train / test? Library: sklearn.model_selection.train_test_split Syntax: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) 4. What is decision tree? Given a data of attributes together with its classes, a decision tree produces a sequence of rules that can be used to classify the data. 5. What is the syntax for decision tree classifier? Library: sklearn.tree.DecisionTreeClassifier Define model: dtc = DecisionTreeClassifier() Fit model: model = dtc.fit(x, y) Predictions: p...

Introduction to Datascience

Data Science has become one of the most demanded jobs of the 21st century. What is Data Science? “Data Science is about extraction, preparation, analysis, visualization, and maintenance of information. It is a cross-disciplinary field which uses scientific methods and processes to draw insights from data. ” As a data scientist, you take a complex business problem, compile research from it, creating it into data, then use that data to solve the problem. A Data Scientist, specializing in Data Science, not only analyzes the data but also uses machine learning algorithms to predict future occurrences of an event. Therefore, we can understand Data Science as a field that deals with data processing, analysis, and extraction of insights from the data using various statistical methods and computer algorithms. It is a multidisciplinary field that combines mathematics, statistics, and computer science. Why Data Science? So, after knowing what exactly Data Science is, you must explore ...

Scope of an Artificial Intelligence

Artificial Intelligence has grown exponentially in the past decade, and so have the career opportunities as an AI expert/specialist. But what exactly does an AI expert do? Also, is becoming an expert the only option while pursuing a career in artificial intelligence?I don’t have any programming/ coding background. Can I still work as an AI expert? And, what specialization or skill set do I need to acquire to get into this field? Skills Required to Build a Career in Artificial Intelligence 1. Sound Mathematical and Algorithmic Understanding To be an ideal candidate in AI, you need to have solid knowledge of applied mathematics and a set of algorithms. Having proficiency in problem-solving and analytical abilities will help you in performing tasks in a more efficient way. You must also have reasonable knowledge of statistics and probability. This helps in understanding various models of AI, like Naive Bayes, Gaussian Mixture Model, etc. 2. Basic Know-How of Programmin...