Skip to main content

20 Must know Data Science Interview Questions by kdnuggets

The Most important questions which is generally asked by the technical panel :

1. Explain what regularization is and why it is useful.
2. Which data scientists do you admire most? which startups?
3. How would you validate a model you created to generate a predictive model of a quantitative outcome variable using multiple regression.
4. Explain what precision and recall are. How do they relate to the ROC curve?
5. How can you prove that one improvement you've brought to an algorithm is really an improvement over not doing anything?
6. What is root cause analysis?
7. Are you familiar with pricing optimization, price elasticity, inventory management, competitive intelligence? Give examples.
8. What is statistical power?
9. Explain what resampling methods are and why they are useful. Also explain their limitations.
10. Is it better to have too many false positives, or too many false negatives? Explain.
11. What is selection bias, why is it important and how can you avoid it?
12. Give an example of how you would use experimental design to answer a question about user behavior.
13. What is the difference between "long" and "wide" format data?
14. What method do you use to determine whether the statistics published in an article (e.g. newspaper) are either wrong or presented to support the author's point of view, rather than correct, comprehensive factual information on a specific subject?
15. Explain Edward Tufte's concept of "chart junk."
16. How would you screen for outliers and what should you do if you find one?
17. How would you use either the extreme value theory, Monte Carlo simulations or mathematical statistics (or anything else) to correctly estimate the chance of a very rare event?
18. What is a recommendation engine? How does it work?
19. Explain what a false positive and a false negative are. Why is it important to differentiate these from each other?
20. Which tools do you use for visualization? What do you think of Tableau? R? SAS? (for graphs). How to efficiently represent 5 dimension in a chart (or in a video)?

Answers from kdnuggets : https://www.kdnuggets.com/2016/02/21-data-science-interview-questions-answers.html

Happy Learning...!!

Comments

Popular posts from this blog

Data Science Interview Questions -Part 2

1) What are the differences between supervised and unsupervised learning? Supervised Learning Unsupervised Learning Uses known and labeled data as input Supervised learning has a feedback mechanism  Most commonly used supervised learning algorithms are decision trees, logistic regression, and support vector machine Uses unlabeled data as input Unsupervised learning has no feedback mechanism  Most commonly used unsupervised learning algorithms are k-means clustering, hierarchical clustering, and apriori algorithm 2) How is logistic regression done? Logistic regression measures the relationship between the dependent variable (our label of what we want to predict) and one or more independent variables (our features) by estimating probability using its underlying logistic function (sigmoid). The image shown below depicts how logistic regression works: The formula and graph for the sigmoid function is as shown: 3) Explain the steps in making a deci...

Why Central Limit Theorem is Important for evey Data Scientist?

The Central Limit Theorem is at the core of what every data scientist does daily: make statistical inferences about data. The theorem gives us the ability to quantify the likelihood that our sample will deviate from the population without having to take any new sample to compare it with. We don’t need the characteristics about the whole population to understand the likelihood of our sample being representative of it. The concepts of confidence interval and hypothesis testing are based on the CLT. By knowing that our sample mean will fit somewhere in a normal distribution, we know that 68 percent of the observations lie within one standard deviation from the population mean, 95 percent will lie within two standard deviations and so on. In other words we can say " It all has to do with the distribution of our population. This theorem allows you to simplify problems in statistics by allowing you to work with a distribution that is approximately normal."  The CLT is...

Future of Data Science

It is rightly said that Data Scientists would be shaping the future of the businesses in the years to come. And trust me they are already on their path to do so. Over the years, data is constantly being generated and collected as well. Now, the field of data sciences has put this humongous pile of data to good use. Now, data can be collected, processed, analyzed and converted into a highly useful piece of information that would benefit the businesses with better and well-informed decision-making capability. "Data is a Precious Thing and will Last Longer than the Systems themselves." Also, Vinod Khosla, an American Billionaire Businessman and Co-founder of Sun Microsystems declared – "In the next 10 years, Data Science and Software will do more for Medicines than all of the Biological Sciences together." By the above two statements, it is clear that data proliferation will never end and because of that, the use of data related technologies like Data Science and Big D...