What's a Decision Tree?
This post is based on the chapter in the great book: Artificial Intelligence: A Modern Approach
So what’s a decision tree? Simply put a way to make predictions based on observations from a knowledge base. Let’s say we observed the following behavior data set for deciding whether to wait for a table at a restaurant. It contains twelve observations in total and is based on 10 attributes.
The goal now is to see if there is a pattern in this data that allows us to predict whether a new observation based on these attributes will yield a positive or negative decision for the “will wait?” question.
But how can we model this decision? Decision Trees to the rescue! In this model we try to build a tree that leads us through intermediate decisions (stored in internal nodes) to a definite decision (stored in it’s leaves) with the minimum steps required. Trivially one might just create a branch for each example, however this is very inefficient and has terribly predictive performance for new unobserved examples.
This whole idea actually more intuitive than it sounds, let’s look at the algorithm as defined in the book:
- If there are some positive and some negative examples, then choose the best attribute to split them.
- If all the remaining examples are positive (or all negative), then we are done: we can answer Yes or No.
- If there are no examples left, it means that no such example has been observed, and we return a default value calculated from the majority classification at the node’s parent.
- If there are no attributes left, but both positive and negative examples, we have a problem. It means that these examples have exactly the same description, but different classifications. This happens when some of the data are incorrect; we say there is noise in the data. It also happens either when the attributes do not give enough information to describe the situation fully, or when the domain is truly nondeterministic. One simple way out of the problem is to use a majority vote.
Alright, so basically it’s a recursion that tries to find the attribute that best splits the remaining examples in each step.
First of the main
decision_tree_learning function. It’s basically a 1:1 translation of the
informal description above.
majority_value returns the
WillWait value that occured most often for this
subset of examples.
Next of is the
choose_attribute function that selects the attribute that
yields the best split of the example subset. That’s pretty much it for the
decision tree itself. the
choose_attribute function is generic and can use
different heuristics. Here we’ll use a heuristic that is based on which
attribute provides the highest “information gain”.
distinct is a helper function returning an array with the distinct values
of the input array, in this case the different values
for an attribute.
As stated the heuristic in this example is “information gain” the mathematical details are beyond the scope of this quick post. You can find the details in the book or here. In general, the information gain from the attribute test is the difference between the original information requirement (in bits) and the new requirement.
split_pn is just a simple function that splits the
examples into positive (
"true") and negative (
And that’s it. The resulting tree for running this code can be seen at the bottom.
Note: the tree we arrived at here is slightly different than Russel’s & Norvig’s, it’s actually slightly smaller! If that’s due to a bug in my implementation please let me know ;)You have a question or found an issue?
Then head over to Github and open an Issue please!