Skip to main content

15 Common questions for Machine Learning...!!

 1. What is logistic regression?

Logistic regression is a machine learning algorithm for classification. In this algorithm, the probabilities describing the possible outcomes of a single trial are modelled using a logistic function.

2. What is the syntax for logistic regression?

Library: sklearn.linear_model.LogisticRegression

Define model: lr = LogisticRegression()

Fit model: model = lr.fit(x, y)

Predictions: pred = model.predict_proba(test)

3. How do you split the data in train / test?

Library: sklearn.model_selection.train_test_split

Syntax: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

4. What is decision tree?

Given a data of attributes together with its classes, a decision tree produces a sequence of rules that can be used to classify the data.

5. What is the syntax for decision tree classifier?

Library: sklearn.tree.DecisionTreeClassifier
Define model: dtc = DecisionTreeClassifier()
Fit model: model = dtc.fit(x, y)
Predictions: pred = model.predict_proba(test)

6. What is random forest?

Random forest classifier is a meta-estimator that fits a number of decision trees on various sub-samples of datasets and uses average to improve the predictive accuracy of the model and controls over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement.

7. What is the syntax for random forest classifier?

Library: sklearn.ensemble.RandomForestClassifier
Define model: rfc = RandomForestClassifier()
Fit model: model = rfc.fit(x, y)
Predictions: pred = model.predict_proba(test)

8. What is gradient boosting?

Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function.

9. What is the syntax for gradient boosting classifier?

Library: sklearn.ensemble.GradientBoostingClassifier
Define model: gbc = GradientBoostingClassifier()
Fit model: model = gbc.fit(x, y)
Predictions: pred = model.predict_proba(test)

10. What is SVM?

Support vector machine is a representation of the training data as points in space separated into categories by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall.

11. What is the difference between KNN and KMeans?

KNN:

Supervised classification algorithm
Classifies new data points accordingly to the k number or the closest data points

KMeans:

Unsupervised clustering algorithm

Groups data into k number of clusters.

12. How do you treat missing values?

Drop rows having missing values

DataFrame.dropna(axis=0, how=’any’, inplace=True)

Drop columns

DataFrame.dropna(axis=1, how=’any’, inplace=True)

Replace missing values with zero / mean

df[‘income’].fillna(0)
df[‘income’] = df[‘income’].fillna((df[‘income’].mean()))

13. How do you treat outliers?

Inter quartile range is used to identify the outliers.
Q1 = df[‘income’].quantile(0.25)
Q3 = df[‘income’].quantile(0.75)
IQR = Q3 — Q1
df = df[(df[‘income’] >= (Q1–1.5 * IQR)) & (df[‘income’] <= (Q3 + 1.5 * IQR))]

14. What is bias / variance trade off?

Definition

The Bias-Variance Trade off is relevant for supervised machine learning, specifically for predictive modelling. It’s a way to diagnose the performance of an algorithm by breaking down its prediction error.

Error from Bias

Bias is the difference between your model’s expected predictions and the true values.

This is known as under-fitting.

Does not improve with collecting more data points.

Error from Variance

Variance refers to your algorithm’s sensitivity to specific sets of training data.
This is known as over-fitting.
Improves with collecting more data points.

15. How do you treat categorical variables?

Replace categorical variables with the average of target for each category

Image for post
Image for post
by applying one hot encoding we can treat Categorical variable..!!

Comments

Popular posts from this blog

Math Skills required for Data Science Aspirants

The knowledge of this essential math is particularly important for newcomers arriving at data science from other professions, Specially whosoever wanted to transit their career in to Data Science field (Aspirant). Because mathematics is backbone of Data science , you must have knowledge to deal with data, behind any algorithm mathematics plays an important role. Here am going to iclude some of the topics which is Important if you dont have maths background.  1. Statistics and Probability 2. Calculus (Multivariable) 3. Linear Algebra 4.  Methods for Optimization 5. Numerical Analysis 1. Statistics and Probability Statistics and Probability is used for visualization of features, data preprocessing, feature transformation, data imputation, dimensionality reduction, feature engineering, model evaluation, etc. Here are the topics you need to be familiar with: Mean, Median, Mode, Standard deviation/variance, Correlation coefficient and the covariance matrix, Probability distribution...

R vs Python: Who is the Winner according to me...!!

As a data scientist, you probably want and need to learn Structured Query Language, or SQL. SQL is the de-facto language of relational databases, where most corporate information still resides. But that only gives you the ability to retrieve the data — not to clean it up or run models against it — and that’s where Python and R come in.R and Python both share similar features and are the most popular tools used by data scientists. Both are open-source and henceforth free yet Python is structured as a broadly useful programming language while R is created for statistical analysis. A little background on R R was created by Ross Ihaka and Robert Gentleman — two statisticians from the University of Auckland in New Zealand. It was initially released in 1995 and they launched a stable beta version in 2000. It’s an interpreted language (you don’t need to run it through a compiler before running the code) and has an extremely powerful suite of tools for statistical modeling and graphing...

Statistics Interview Questions Part-1

Q1. What is the difference between “long” and “wide” format data? In the  wide-format , a subject’s repeated responses will be in a single row, and each response is in a separate column. In the  long-format , each row is a one-time point per subject. You can recognize data in wide format by the fact that columns generally represent groups. Q2. What do you understand by the term Normal Distribution? Data is usually distributed in different ways with a bias to the left or to the right or it can all be jumbled up. However, there are chances that data is distributed around a central value without any bias to the left or right and reaches normal distribution in the form of a bell-shaped curve. Figure:   Normal distribution in a bell curve The random variables are distributed in the form of a symmetrical, bell-shaped curve. Properties of Normal Distribution are as follows; Unimodal -one mode Symmetrical -left and right halves are mirror image...