Skip to main content

Data Science Basic Interview Questions- Part 1

Q1. What is Data Science? List the differences between supervised and unsupervised learning.

Data Science is a blend of various tools, algorithms, and machine learning principles with the goal to discover hidden patterns from the raw data. How is this different from what statisticians have been doing for years?
The answer lies in the difference between explaining and predicting.
Data Analyst v/s Data Science - Edureka
The differences between supervised and unsupervised learning are as follows;
Supervised Learning
Unsupervised Learning
Input data is labelled.
Input data is unlabelled.
Uses a training data set.
Uses the input data set.
Used for prediction.
Used for analysis.
Enables classification and regression.
Enables Classification, Density Estimation, & Dimension Reduction

Q2. What is Selection Bias?

Selection bias is a kind of error that occurs when the researcher decides who is going to be studied. It is usually associated with research where the selection of participants isn’t random. It is sometimes referred to as the selection effect. It is the distortion of statistical analysis, resulting from the method of collecting samples. If the selection bias is not taken into account, then some conclusions of the study may not be accurate.
The types of selection bias include:
  1. Sampling bias: It is a systematic error due to a non-random sample of a population causing some members of the population to be less likely to be included than others resulting in a biased sample.
  2. Time interval: A trial may be terminated early at an extreme value (often for ethical reasons), but the extreme value is likely to be reached by the variable with the largest variance, even if all variables have a similar mean.
  3. Data: When specific subsets of data are chosen to support a conclusion or rejection of bad data on arbitrary grounds, instead of according to previously stated or generally agreed criteria.
  4. Attrition: Attrition bias is a kind of selection bias caused by attrition (loss of participants) discounting trial subjects/tests that did not run to completion.

Q3. What is bias-variance trade-off?

Bias: Bias is an error introduced in your model due to oversimplification of the machine learning algorithm. It can lead to underfitting. When you train your model at that time model makes simplified assumptions to make the target function easier to understand.
Low bias machine learning algorithms — Decision Trees, k-NN and SVM High bias machine learning algorithms — Linear Regression, Logistic Regression
Variance: Variance is error introduced in your model due to complex machine learning algorithm, your model learns noise also from the training data set and performs badly on test data set. It can lead to high sensitivity and overfitting.
Normally, as you increase the complexity of your model, you will see a reduction in error due to lower bias in the model. However, this only happens until a particular point. As you continue to make your model more complex, you end up over-fitting your model and hence your model will start suffering from high variance.
Bias-Variance trade-off: The goal of any supervised machine learning algorithm is to have low bias and low variance to achieve good prediction performance.
There is no escaping the relationship between bias and variance in machine learning. Increasing the bias will decrease the variance. Increasing the variance will decrease bias.

Q4. What is a confusion matrix?

The confusion matrix is a 2X2 table that contains 4 outputs provided by the binary classifier. Various measures, such as error-rate, accuracy, specificity, sensitivity, precision and recall are derived from it. Confusion Matrix
A data set used for performance evaluation is called a test data set. It should contain the correct labels and predicted labels.
The predicted labels will exactly the same if the performance of a binary classifier is perfect.
The predicted labels usually match with part of the observed labels in real-world scenarios.
A binary classifier predicts all data instances of a test data set as either positive or negative. This produces four outcomes-
Basic measures derived from the confusion matrix
Reference : Thanks to Edureka for an awesome collection 

Comments

Popular posts from this blog

Data Science Interview Questions -Part 2

1) What are the differences between supervised and unsupervised learning? Supervised Learning Unsupervised Learning Uses known and labeled data as input Supervised learning has a feedback mechanism  Most commonly used supervised learning algorithms are decision trees, logistic regression, and support vector machine Uses unlabeled data as input Unsupervised learning has no feedback mechanism  Most commonly used unsupervised learning algorithms are k-means clustering, hierarchical clustering, and apriori algorithm 2) How is logistic regression done? Logistic regression measures the relationship between the dependent variable (our label of what we want to predict) and one or more independent variables (our features) by estimating probability using its underlying logistic function (sigmoid). The image shown below depicts how logistic regression works: The formula and graph for the sigmoid function is as shown: 3) Explain the steps in making a deci...

Why Central Limit Theorem is Important for evey Data Scientist?

The Central Limit Theorem is at the core of what every data scientist does daily: make statistical inferences about data. The theorem gives us the ability to quantify the likelihood that our sample will deviate from the population without having to take any new sample to compare it with. We don’t need the characteristics about the whole population to understand the likelihood of our sample being representative of it. The concepts of confidence interval and hypothesis testing are based on the CLT. By knowing that our sample mean will fit somewhere in a normal distribution, we know that 68 percent of the observations lie within one standard deviation from the population mean, 95 percent will lie within two standard deviations and so on. In other words we can say " It all has to do with the distribution of our population. This theorem allows you to simplify problems in statistics by allowing you to work with a distribution that is approximately normal."  The CLT is...

Future of Data Science

It is rightly said that Data Scientists would be shaping the future of the businesses in the years to come. And trust me they are already on their path to do so. Over the years, data is constantly being generated and collected as well. Now, the field of data sciences has put this humongous pile of data to good use. Now, data can be collected, processed, analyzed and converted into a highly useful piece of information that would benefit the businesses with better and well-informed decision-making capability. "Data is a Precious Thing and will Last Longer than the Systems themselves." Also, Vinod Khosla, an American Billionaire Businessman and Co-founder of Sun Microsystems declared – "In the next 10 years, Data Science and Software will do more for Medicines than all of the Biological Sciences together." By the above two statements, it is clear that data proliferation will never end and because of that, the use of data related technologies like Data Science and Big D...