# ML Series4: Decision Tree 🌲

Classification and Regression Trees

Tree-based algorithms is a widely used supervised learning category. On data competitions like Kaggle, decision tree based methods like LightGBM and Random Forrest usually are the first go-to option due to its promising performance and good human-understandable explanation. Therefore, I think it’s important to have a separate series to discuss this algorithm.🔥

# Introduction

Tree based models split the data multiple times according to certain cutoff values in the features. Through splitting, different subsets of the dataset are created, with each instance belonging to one subset. The final subsets are called terminal or leaf nodes and the intermediate subsets are called internal nodes or split nodes. To predict the outcome in each leaf node, the average outcome of the training data in this node is used. Decision Tree can capture nonlinear relationship between features and target.

# Classification Learning Algorithm

We can see that tree generation is a Recursive Process. In the basic algorithm, there are three ways for recursion to return:

1. If all samples belong to a same class, no need to split
2. If current features is None OR all features have the same values on all samples, can’t split
3. Current node samples is None, can’t split

## Split Selection

From above algorithm, it is clear that Step 4 is the most important step, about how to pick the best feature to split node. Generally, as more splits, we hope nodes are more likely to contain samples from the same class as possible, thus higher “purity”.

## 🌲ID3 Algorithm

Information Entropy is a commonly used metric for node purity The higher the number, the messier the data is. When there is only one class, Ent = 0.

Pk is the proportion of the kth class to the sample D.

Information Gain

When splitting a node, we choose the feature with the highest information gain.

## 🌲C4.5 Algorithm

ID3 Algorithm tends to prefer features with a lot of different values. For example, if there is a feature like ID number, dataset D will split into D classes. Since every branch only has one sample, its information entropy = 0, very pure, but is useless for classification. Therefore, C4.5 uses gain ratio.

## 🌲CART Algorithm

Another measure decision tree often uses is Gini Index. Gini is the probability of two randomly chosen samples from dataset D are different

Therefore, among feature set A, we choose the lowest Gini Index feature to split node, a* = argmin Gini_index(D,a)

# Regression Learning Algorithm

For continuous features, it is not efficient to treat each value as a branch. So we need to discretize continuous variable. The most common way is to do binary partition in continue feature a, and find a point sspliting dataset D into ≤s and >s.

Put into words

# Interview Questions

• What is Information Gain? What are its disadvantages?
• How does a decision tree handle missing attribute values?
• Do we require Feature Scaling for Decision Trees?
• If it takes one hour to train a Decision Tree on a training set containing 1 million instances, roughly how much time will it take to train another Decision Tree on a training set containing 10 million instances?