Skip to main content

Why Central Limit Theorem is Important for evey Data Scientist?

The Central Limit Theorem is at the core of what every data scientist does daily: make statistical inferences about data.

The theorem gives us the ability to quantify the likelihood that our sample will deviate from the population without having to take any new sample to compare it with. We don’t need the characteristics about the whole population to understand the likelihood of our sample being representative of it.

The concepts of confidence interval and hypothesis testing are based on the CLT. By knowing that our sample mean will fit somewhere in a normal distribution, we know that 68 percent of the observations lie within one standard deviation from the population mean, 95 percent will lie within two standard deviations and so on. In other words we can say "It all has to do with the distribution of our population. This theorem allows you to simplify problems in statistics by allowing you to work with a distribution that is approximately normal." 




The CLT is not limited to making inferences from a sample about a population. There are four kinds of inferences we can make based on the CLT

1. We have the information of a valid sample. We can make accurate assumptions about it’s population.
2. We have the information of the population. We can make accurate assumptions about a valid sample from that population.
3. We have the information of a population and a valid sample. We can accurately infer if the sample was drawn from that population.
4. We have the information about two different valid samples. We can accurately infer if the two samples where drawn from the same population.

Condtions for Central Limit Theorem:

Independence.
>> The sampled obervsations must be independent
>> random sampling should be done.
>> if sampling without replacement, the sample should be less than 10% of the population.

Sample skew
>> The population distribution should be normal
>> But if the distribution is skewed, the sample must be large (greater than 30)

Important Points to remember :

The central limit theorem (CLT) states that the distribution of sample means approximates a normal distribution as the sample size gets larger.

Sample sizes equal to or greater than 30 are considered sufficient for the CLT to hold.

A key aspect of CLT is that the average of the sample means and standard deviations will equal the population mean and standard deviation.

A sufficiently large sample size can predict the characteristics of a population accurately.

Comments

Popular posts from this blog

Important Python Libraries for Data Science

Python is the most widely used programming language today. When it comes to solving data science tasks and challenges, Python never ceases to surprise its users. Most data scientists are already leveraging the power of Python programming every day. Python is an easy-to-learn, easy-to-debug, widely used, object-oriented, open-source, high-performance language, and there are many more benefits to Python programming.People in Data Science definitely know about the Python libraries that can be used in Data Science but when asked in an interview to name them or state its function, we often fumble up or probably not remember more than 5 libraries. Important Python Libraries for Data Science: Pandas NumPy SciPy Matplotlib TensorFlow Seaborn Scikit Learn Keras 1. Pandas Pandas (Python data analysis) is a must in the data science life cycle. It is the most popular and widely used Python library for data science, along with NumPy in matplotlib. With around 17,00 comments on GitH...

Daily Task performed by Data Scientist at Work place - Life of a Data Scientist

Data Science is a multidimensional field that uses scientific methods, tools, and algorithms to extract knowledge and insights from structured and unstructured data.But in reality, he does so much more than just studying the data. I agree that all his work is related to data but it involves a number of other processes based on data.Data Science is a multidisciplinary field. It involves the systematic blend of scientific and statistical methods, processes, algorithm development and technologies to extract meaningful information from data. The average Data Scientist’s work week as follows: Typical work weeks devour around 50 hours. The Data Scientists generally maintain internal records of daily results. The Data Scientists also keep extensive notes on their modeling projects for repeatable processes. The good Data Scientists can begin their career with a $80k salary, and the high-end experts can hope to make $400K. The industry attrition rate for DS is high as organizations fre...