Skip to main content

Random Forest Algorithm

Random Forest is an ensemble machine learning algorithm that follows the bagging technique. The base estimators in the random forest are decision trees. Random forest randomly selects a set of features that are used to decide the best split at each node of the decision tree.

Looking at it step-by-step, this is what a random forest model does:

1. Random subsets are created from the original dataset (bootstrapping).

2. At each node in the decision tree, only a random set of features are considered to decide the best split.

3. A decision tree model is fitted on each of the subsets.

4. The final prediction is calculated by averaging the predictions from all decision trees.

To sum up, the Random forest randomly selects data points and features and builds multiple trees (Forest).

Random Forest is used for feature importance selection. The attribute (.feature_importances_) is used to find feature importance.

Some Important Parameters:-

1. n_estimators:- It defines the number of decision trees to be created in a random forest.

2. criterion:- "Gini" or "Entropy."

3. min_samples_split:- Used to define the minimum number of samples required in a leaf node before a split is attempted

4. max_features: -It defines the maximum number of features allowed for the split in each decision tree.

5. n_jobs:- The number of jobs to run in parallel for both fit and predict. Always keep (-1) to use all the cores for parallel processing.

Comments

Popular posts from this blog

Data is the New oil of Industry?

Let's go back to 18th century ,when development was taking its first footstep.The time when oil was considered to be the subset of industrial revolution. Oil than tends to be the most valuable asset in those time. Now let's come back in present. In 21st century, data is vigorously called the foundation of information revolution. But the question that arises is why are we really calling data as the new oil. Well for it's explanation Now we are going to compare Data Vs Oil Data is an essential resource that powers the information economy in much the way that oil has fueled the industrial economy. Once upon a time, the wealthiest were those with most natural resources, now it’s knowledge economy, where the more you know is proportional to more data that you have. Information can be extracted from data just as energy can be extracted from oil. Traditional Oil powered the transportation era, in the same way that Data as the new oil is also powering the emerging transportation op...

Important Python Libraries for Data Science

Python is the most widely used programming language today. When it comes to solving data science tasks and challenges, Python never ceases to surprise its users. Most data scientists are already leveraging the power of Python programming every day. Python is an easy-to-learn, easy-to-debug, widely used, object-oriented, open-source, high-performance language, and there are many more benefits to Python programming.People in Data Science definitely know about the Python libraries that can be used in Data Science but when asked in an interview to name them or state its function, we often fumble up or probably not remember more than 5 libraries. Important Python Libraries for Data Science: Pandas NumPy SciPy Matplotlib TensorFlow Seaborn Scikit Learn Keras 1. Pandas Pandas (Python data analysis) is a must in the data science life cycle. It is the most popular and widely used Python library for data science, along with NumPy in matplotlib. With around 17,00 comments on GitH...

Differentiate between univariate, bivariate and multivariate analysis.

Univariate analysis are descriptive statistical analysis techniques which can be differentiated based on one variable involved at a given point of time. For example, the pie charts of sales based on territory involve only one variable and can the analysis can be referred to as univariate analysis. The bivariate analysis attempts to understand the difference between two variables at a time as in a scatterplot. For example, analyzing the volume of sale and spending can be considered as an example of bivariate analysis. Multivariate analysis deals with the study of more than two variables to understand the effect of variables on the responses.