Skip to main content

Why Central Limit Theorem is Important for evey Data Scientist?

The Central Limit Theorem is at the core of what every data scientist does daily: make statistical inferences about data.

The theorem gives us the ability to quantify the likelihood that our sample will deviate from the population without having to take any new sample to compare it with. We don’t need the characteristics about the whole population to understand the likelihood of our sample being representative of it.

The concepts of confidence interval and hypothesis testing are based on the CLT. By knowing that our sample mean will fit somewhere in a normal distribution, we know that 68 percent of the observations lie within one standard deviation from the population mean, 95 percent will lie within two standard deviations and so on. In other words we can say "It all has to do with the distribution of our population. This theorem allows you to simplify problems in statistics by allowing you to work with a distribution that is approximately normal." 




The CLT is not limited to making inferences from a sample about a population. There are four kinds of inferences we can make based on the CLT

1. We have the information of a valid sample. We can make accurate assumptions about it’s population.
2. We have the information of the population. We can make accurate assumptions about a valid sample from that population.
3. We have the information of a population and a valid sample. We can accurately infer if the sample was drawn from that population.
4. We have the information about two different valid samples. We can accurately infer if the two samples where drawn from the same population.

Condtions for Central Limit Theorem:

Independence.
>> The sampled obervsations must be independent
>> random sampling should be done.
>> if sampling without replacement, the sample should be less than 10% of the population.

Sample skew
>> The population distribution should be normal
>> But if the distribution is skewed, the sample must be large (greater than 30)

Important Points to remember :

The central limit theorem (CLT) states that the distribution of sample means approximates a normal distribution as the sample size gets larger.

Sample sizes equal to or greater than 30 are considered sufficient for the CLT to hold.

A key aspect of CLT is that the average of the sample means and standard deviations will equal the population mean and standard deviation.

A sufficiently large sample size can predict the characteristics of a population accurately.

Comments

Popular posts from this blog

Math Skills required for Data Science Aspirants

The knowledge of this essential math is particularly important for newcomers arriving at data science from other professions, Specially whosoever wanted to transit their career in to Data Science field (Aspirant). Because mathematics is backbone of Data science , you must have knowledge to deal with data, behind any algorithm mathematics plays an important role. Here am going to iclude some of the topics which is Important if you dont have maths background.  1. Statistics and Probability 2. Calculus (Multivariable) 3. Linear Algebra 4.  Methods for Optimization 5. Numerical Analysis 1. Statistics and Probability Statistics and Probability is used for visualization of features, data preprocessing, feature transformation, data imputation, dimensionality reduction, feature engineering, model evaluation, etc. Here are the topics you need to be familiar with: Mean, Median, Mode, Standard deviation/variance, Correlation coefficient and the covariance matrix, Probability distribution...

R vs Python: Who is the Winner according to me...!!

As a data scientist, you probably want and need to learn Structured Query Language, or SQL. SQL is the de-facto language of relational databases, where most corporate information still resides. But that only gives you the ability to retrieve the data — not to clean it up or run models against it — and that’s where Python and R come in.R and Python both share similar features and are the most popular tools used by data scientists. Both are open-source and henceforth free yet Python is structured as a broadly useful programming language while R is created for statistical analysis. A little background on R R was created by Ross Ihaka and Robert Gentleman — two statisticians from the University of Auckland in New Zealand. It was initially released in 1995 and they launched a stable beta version in 2000. It’s an interpreted language (you don’t need to run it through a compiler before running the code) and has an extremely powerful suite of tools for statistical modeling and graphing...

Statistics Interview Questions Part-1

Q1. What is the difference between “long” and “wide” format data? In the  wide-format , a subject’s repeated responses will be in a single row, and each response is in a separate column. In the  long-format , each row is a one-time point per subject. You can recognize data in wide format by the fact that columns generally represent groups. Q2. What do you understand by the term Normal Distribution? Data is usually distributed in different ways with a bias to the left or to the right or it can all be jumbled up. However, there are chances that data is distributed around a central value without any bias to the left or right and reaches normal distribution in the form of a bell-shaped curve. Figure:   Normal distribution in a bell curve The random variables are distributed in the form of a symmetrical, bell-shaped curve. Properties of Normal Distribution are as follows; Unimodal -one mode Symmetrical -left and right halves are mirror image...