ML Series4: Decision Tree 🌲

Classification and Regression Trees

Tree-based algorithms is a widely used supervised learning category. On data competitions like Kaggle, decision tree based methods like LightGBM and Random Forrest usually are the first go-to option due to its promising performance and good human-understandable explanation. Therefore, I think it’s important to have a separate series to discuss this algorithm.🔥


Tree based models split the data multiple times according to certain cutoff values in the features. Through splitting, different subsets of the dataset are created, with each instance belonging to one subset. The final subsets are called terminal or leaf nodes and the intermediate subsets are called internal nodes or split nodes. To predict the outcome in each leaf node, the average outcome of the training data in this node is used. Decision Tree can capture nonlinear relationship between features and target.

Classification Learning Algorithm

We can see that tree generation is a Recursive Process. In the basic algorithm, there are three ways for recursion to return:

  1. If all samples belong to a same class, no need to split
  2. If current features is None OR all features have the same values on all samples, can’t split
  3. Current node samples is None, can’t split

Split Selection

From above algorithm, it is clear that Step 4 is the most important step, about how to pick the best feature to split node. Generally, as more splits, we hope nodes are more likely to contain samples from the same class as possible, thus higher “purity”.

🌲ID3 Algorithm

Information Entropy is a commonly used metric for node purity

The higher the number, the messier the data is. When there is only one class, Ent = 0.

Pk is the proportion of the kth class to the sample D.

Information Gain

The higher the number, the better the classification.

When splitting a node, we choose the feature with the highest information gain.

🌲C4.5 Algorithm

ID3 Algorithm tends to prefer features with a lot of different values. For example, if there is a feature like ID number, dataset D will split into D classes. Since every branch only has one sample, its information entropy = 0, very pure, but is useless for classification. Therefore, C4.5 uses gain ratio.

More values of feature a, leads to bigger value of IV(a)

🌲CART Algorithm

Another measure decision tree often uses is Gini Index. Gini is the probability of two randomly chosen samples from dataset D are different

The smaller the Gini, the higher the purity of dataset

Therefore, among feature set A, we choose the lowest Gini Index feature to split node, a* = argmin Gini_index(D,a)

Regression Learning Algorithm

For continuous features, it is not efficient to treat each value as a branch. So we need to discretize continuous variable. The most common way is to do binary partition in continue feature a, and find a point sspliting dataset D into ≤s and >s.

Put into words

Interview Questions

  • What is Information Gain? What are its disadvantages?
  • How does a decision tree handle missing attribute values?
  • Do we require Feature Scaling for Decision Trees?
  • If it takes one hour to train a Decision Tree on a training set containing 1 million instances, roughly how much time will it take to train another Decision Tree on a training set containing 10 million instances?
  • Advantage and disadvantage of decision tree
  • Other advantages of decision tree: no need to scale features, insensitive to outliers and automatically handle missing values

Above are the algorithms for regression and classification tree. They are the cornerstones of some very popular modern machine learning methods using Bagging and Boosting. We will introduce some algorithms that are based on the decision tree in the next post.

As always, thank you for reading the blog! Hope this is helpful. Please let me know if you need more information.



Data Scientist in Finance. Take care of the memories, polish knowledge.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Junlin Liu

Data Scientist in Finance. Take care of the memories, polish knowledge.