Skip to main content

Posts

How to deal with missing values in data cleaning

The data you inherit for analysis will come from multiple sources and would have been pulled adhoc. So this data will not be immediately ready for you to run any kind of model on. One of the most common issues you will have to deal with is missing values in the dataset. There are many reasons why values might be missing - intentional, user did not fill up, online forms broken, accidentally deleted, legacy issues etc.  Either way you will need to fix this problem. There are 3 ways to do this - either you will ignore the missing values, delete the missing value rows or fill the missing values with an approximation. Its easiest to just drop the missing observations but you need to very careful before you do that, because the absence of a value might actually be conveying some information about the data pattern. If you decide to drop missing values : df_no_missing = df.dropna() will drop any rows with any value missing. Even if some values are available in a row it will still get dropped e
Recent posts

15 Common questions for Machine Learning...!!

  1. What is logistic regression? Logistic regression is a machine learning algorithm for classification. In this algorithm, the probabilities describing the possible outcomes of a single trial are modelled using a logistic function. 2. What is the syntax for logistic regression? Library: sklearn.linear_model.LogisticRegression Define model: lr = LogisticRegression() Fit model: model = lr.fit(x, y) Predictions: pred = model.predict_proba(test) 3. How do you split the data in train / test? Library: sklearn.model_selection.train_test_split Syntax: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) 4. What is decision tree? Given a data of attributes together with its classes, a decision tree produces a sequence of rules that can be used to classify the data. 5. What is the syntax for decision tree classifier? Library: sklearn.tree.DecisionTreeClassifier Define model: dtc = DecisionTreeClassifier() Fit model: model = dtc.fit(x, y) Predictions: pred

Differentiate between univariate, bivariate and multivariate analysis.

Univariate analysis are descriptive statistical analysis techniques which can be differentiated based on one variable involved at a given point of time. For example, the pie charts of sales based on territory involve only one variable and can the analysis can be referred to as univariate analysis. The bivariate analysis attempts to understand the difference between two variables at a time as in a scatterplot. For example, analyzing the volume of sale and spending can be considered as an example of bivariate analysis. Multivariate analysis deals with the study of more than two variables to understand the effect of variables on the responses.

Random Forest Algorithm

Random Forest is an ensemble machine learning algorithm that follows the bagging technique. The base estimators in the random forest are decision trees. Random forest randomly selects a set of features that are used to decide the best split at each node of the decision tree. Looking at it step-by-step, this is what a random forest model does: 1. Random subsets are created from the original dataset (bootstrapping). 2. At each node in the decision tree, only a random set of features are considered to decide the best split. 3. A decision tree model is fitted on each of the subsets. 4. The final prediction is calculated by averaging the predictions from all decision trees. To sum up, the Random forest randomly selects data points and features and builds multiple trees (Forest). Random Forest is used for feature importance selection. The attribute (.feature_importances_) is used to find feature importance. Some Important Parameters:- 1. n_estimators: - It defines the number of decision tree

Data is the New oil of Industry?

Let's go back to 18th century ,when development was taking its first footstep.The time when oil was considered to be the subset of industrial revolution. Oil than tends to be the most valuable asset in those time. Now let's come back in present. In 21st century, data is vigorously called the foundation of information revolution. But the question that arises is why are we really calling data as the new oil. Well for it's explanation Now we are going to compare Data Vs Oil Data is an essential resource that powers the information economy in much the way that oil has fueled the industrial economy. Once upon a time, the wealthiest were those with most natural resources, now it’s knowledge economy, where the more you know is proportional to more data that you have. Information can be extracted from data just as energy can be extracted from oil. Traditional Oil powered the transportation era, in the same way that Data as the new oil is also powering the emerging transportation op

Future of Data Science

It is rightly said that Data Scientists would be shaping the future of the businesses in the years to come. And trust me they are already on their path to do so. Over the years, data is constantly being generated and collected as well. Now, the field of data sciences has put this humongous pile of data to good use. Now, data can be collected, processed, analyzed and converted into a highly useful piece of information that would benefit the businesses with better and well-informed decision-making capability. "Data is a Precious Thing and will Last Longer than the Systems themselves." Also, Vinod Khosla, an American Billionaire Businessman and Co-founder of Sun Microsystems declared – "In the next 10 years, Data Science and Software will do more for Medicines than all of the Biological Sciences together." By the above two statements, it is clear that data proliferation will never end and because of that, the use of data related technologies like Data Science and Big D

Math Skills required for Data Science Aspirants

The knowledge of this essential math is particularly important for newcomers arriving at data science from other professions, Specially whosoever wanted to transit their career in to Data Science field (Aspirant). Because mathematics is backbone of Data science , you must have knowledge to deal with data, behind any algorithm mathematics plays an important role. Here am going to iclude some of the topics which is Important if you dont have maths background.  1. Statistics and Probability 2. Calculus (Multivariable) 3. Linear Algebra 4.  Methods for Optimization 5. Numerical Analysis 1. Statistics and Probability Statistics and Probability is used for visualization of features, data preprocessing, feature transformation, data imputation, dimensionality reduction, feature engineering, model evaluation, etc. Here are the topics you need to be familiar with: Mean, Median, Mode, Standard deviation/variance, Correlation coefficient and the covariance matrix, Probability distributions (Binomi