Random Forest Algorithm

The Random Forest is a powerful ensemble learning method used for both classification and regression. It operates by constructing a multitude of decision trees at training time.

Core Concept

A Random Forest is essentially a collection of many Decision Trees working together to improve prediction accuracy. By combining multiple models, it reduces the risk of "overfitting" (where a model learns noise instead of patterns) that often affects individual decision trees.

How It Works

1. Bootstrapping (Data Sampling)

The algorithm creates new versions of the original dataset through a process called Bootstrapping:

Row Sampling: New datasets are created by randomly picking rows from the original data with replacement (meaning the same row can appear multiple times in one set).
Feature Sampling: For each tree, only a random subset of columns (features) is selected.

2. Building Decision Trees

A decision tree is trained independently for each bootstrapped dataset. Because each tree sees different data and different features, the "Forest" becomes a diverse group of "experts".

3. Aggregation (The Vote)

To make a final prediction for a new data point:

Every tree in the forest runs its own prediction.
Classification: The final result is decided by majority vote.
Regression: The final result is the average of all tree outputs.

Technical Jargon: Bagging

The term Bagging is a combination of Bootstrap + Aggregating.

Pro Tip: Square Root Rule

When selecting random features for classification, a common best practice is to choose a number of features equal to the square root of the total number of features ( \(\sqrt{n}\) ).

Source: Random Forest Algorithm Clearly Explained!