ML Series4: Decision Tree 🌲
Classification and Regression Trees
Tree-based algorithms is a widely used supervised learning category. On data competitions like Kaggle, decision tree based methods like LightGBM and Random Forrest usually are the first go-to option due to its promising performance and good human-understandable explanation. Therefore, I think it’s important to have a separate series to discuss this algorithm.🔥
Tree based models split the data multiple times according to certain cutoff values in the features. Through splitting, different subsets of the dataset are created, with each instance belonging to one subset. The final subsets are called terminal or leaf nodes and the intermediate subsets are called internal nodes or split nodes. To predict the outcome in each leaf node, the average outcome of the training data in this node is used. Decision Tree can capture nonlinear relationship between features and target.
Classification Learning Algorithm
We can see that tree generation is a Recursive Process. In the basic algorithm, there are three ways for recursion to return:
- If all samples belong to a same class, no need to split
- If current features is None OR all features have the same values on all samples, can’t split
- Current node samples is None, can’t split
From above algorithm, it is clear that Step 4 is the most important step, about how to pick the best feature to split node. Generally, as more splits, we hope nodes are more likely to contain samples from the same class as possible, thus higher “purity”.
Information Entropy is a commonly used metric for node purity
Pk is the proportion of the kth class to the sample D.
When splitting a node, we choose the feature with the highest information gain.
ID3 Algorithm tends to prefer features with a lot of different values. For example, if there is a feature like ID number, dataset D will split into D classes. Since every branch only has one sample, its information entropy = 0, very pure, but is useless for classification. Therefore, C4.5 uses gain ratio.
Another measure decision tree often uses is Gini Index. Gini is the probability of two randomly chosen samples from dataset D are different
Therefore, among feature set A, we choose the lowest Gini Index feature to split node, a* = argmin Gini_index(D,a)
Regression Learning Algorithm
For continuous features, it is not efficient to treat each value as a branch. So we need to discretize continuous variable. The most common way is to do binary partition in continue feature a, and find a point sspliting dataset D into ≤s and >s.
Put into words
- What is Information Gain? What are its disadvantages?
- How does a decision tree handle missing attribute values?
- Do we require Feature Scaling for Decision Trees?
- If it takes one hour to train a Decision Tree on a training set containing 1 million instances, roughly how much time will it take to train another Decision Tree on a training set containing 10 million instances?
- Advantage and disadvantage of decision tree
- Other advantages of decision tree: no need to scale features, insensitive to outliers and automatically handle missing values
Above are the algorithms for regression and classification tree. They are the cornerstones of some very popular modern machine learning methods using Bagging and Boosting. We will introduce some algorithms that are based on the decision tree in the next post.
As always, thank you for reading the blog! Hope this is helpful. Please let me know if you need more information.