Introduction To Tree Based ML Methods

Gradient boosting, decision trees, and random forests are some of the tree-based methods in this lesson.

Jul 23, 2022

∙ Paid

Numeric and categorical outputs are predicted with tree-based learning algorithms, also known as CART.

Boosting, bagging, random forests, decision trees, and other tree-based methods have been proven to be very effective for supervised learning. Moreover, they can predict both discrete and continuous outcomes, which explains their high accuracy.

Three tree-based algorithms will be discussed:

Decision Trees
Random Forests
Gradient Boosting

Decision Trees

A Decision Tree (DT) creates a decision structure by dividing data into groups and interpreting patterns through those groups. Entropy (a measure of variance in data among different classes) is used to divide the data into homogeneous or numerically relevant groups. Decision trees are often graphical representations of tree-like graphs that can be easily understood by non-experts.

Random Forest

RF (Random Forests) is a technique that mitigates overfitting by growing multiple trees. For regression or classification, the results are combined by averaging the output of multiple decision trees, using a randomized selection of input data for each tree.

Gradient Boosting

A boosting technique, like random forests, aggregates the outcomes of multiple decision trees through regression and classification.

Gradient Boosting (GB) is a sequential method that improves the performance of each subsequent decision tree rather than creating random independent variants of a decision tree in parallel.

Introduction To Decision Trees

Based on entropy (the measure of variance within a class), DT creates a decision structure for interpreting patterns by classifying data into homogeneous or numerically relevant groups. Tree-like graphs are useful for displaying decision trees, and they’re easy for non-experts to understand.

It is quite different from an actual tree in that the “leaves” are located at the bottom, or foot, of the tree. Decision rules are expressed by the path from the tree’s root to its terminal leaf node. A branch represents the outcome of a decision or variable, and a leaf node represents a class label.

In order to implement DT, the following steps must be taken:

1 — Import libraries
2 — Import the dataset
3 — Convert non-numeric variables
4 — Remove columns
5 — Set X and y variables
6 — Set algorithm
7 — Evaluate

Exercise

Using the Advertising dataset, let’s use a decision tree classifier to predict a user’s click-through outcome.

Decision Trees Implementation

1: Import Libraries

As we will be predicting a discrete variable, we will use the classification version of the decision tree algorithm. In this example, we attempt to predict the dependent variable Clicked on Ad (0 or 1) by using the DecisionTreeClassifier algorithm from Scikit-learn. A classification report and a confusion matrix will be used to evaluate the model’s performance.

2: Import Dataset

Assign a variable name to the Advertising Dataset and import it as a data frame.

3: Convert Non-Numerical Variables

Numerical values can be obtained by encoding the Country and City variables one-at-a-time.

4: Remove Columns

Ad Topic Line and Timestamp are not relevant or practical for this model, so delete them.

5: Set X and Y Variables

Our dependent variable (y) is the number of times people clicked on the ad, whereas our independent variables (X) are the remaining variables. In addition to Daily Time Spent on Site, Age, Area Income, Daily Internet Usage, Male, Country, and City, there are also independent variables.

6: Set Algorithm

Assign DecisionTreeClassifier() to the variable model.

7: Evaluate

The training model should be tested on the X test data by using the predict function and a new variable name should be assigned.

Nine false negatives and ten false positives were produced by the model. Next, let’s see if we can improve predictive accuracy using multiple decision trees.

Random Forest

In addition to explaining a model’s decision structure, decision trees can also overfit the model.

A decision tree decodes patterns accurately when using training data, but since it uses a fixed sequence of paths, it can make poor predictions when using test data. In addition to having only one tree design, this method is also limited in its ability to manage variance and future outliers.

Multiple trees can be grown using a different technique called RF to mitigate overfitting. For regression or classification, multiple decision trees are grown with a randomized selection of input data for each tree and the results are combined by averaging the outputs.

A randomized and capped variable is also chosen to divide the data. Every tree in the forest would look the same if a full set of variables were examined. Each split would allow the trees to select the optimal variable according to the maximum information gain at the subsequent layer.

However, random forests algorithm does not have a full set of variables to draw from, like a standard decision tree. Because random forests use randomized data and fewer variables, they are less likely to produce similar trees. Because random forests incorporate volatility and volume, they are potentially less likely to be overfitted and provide reliable results.

In order to implement RF, the following steps must be taken:

1 — Import libraries
2 — Import dataset
3 — Convert non-numeric variables
4 — Remove columns
5 — Set X and y variables
6 — Set algorithm
7 — Evaluate

Exercise

In this exercise, we are going to repeat the previous one, but using RandomForestClassifier from Scikit-Learn to learn the dependent and independent variables.

Implementation Of Random Forest

This part will familiarize you with the implementation steps of random forest.

1: Import Libraries

In Scikit-learn, random forests can be built using either classification or regression algorithms. Scikit-learn’s RandomForestClassifier will be used here instead of RandomForestRegressor, which is used for regression.

2: Import Dataset

Import the Advertising dataset.

3: Convert non-numeric variables

Convert Country and City to numeric values using one-hot encoding.

4: Remove Variables

Data frame should be modified by removing the following two variables.

5: Set X and Y Variable

X and Y variables should be assigned the same values, and the data should be split 70/30.

6: Set Algorithm

Give RandomForestClassifier a name and specify how many estimators it should have. In general, this algorithm works best with 100–150 estimators (trees).

7: Evaluate

The predict method can be used to predict the x test values and assign them to a new variable.

A comparatively low number of false-positives (5) and false-negatives (7) were found in the test data, and the F1-score was 0.96 instead of 0.94 in the previous exercise using a decision tree classifier.

Gradient Boosting

For aggregating the results of multiple decision trees, boosting is another regression/classification technique.

It is a sequential method that aims to improve the performance of each subsequent tree, rather than creating independent variants of a decision tree in parallel. Weak models are evaluated and then weighted in subsequent rounds to mitigate misclassifications resulting from earlier rounds. A higher proportion of items are classified incorrectly at the next round than at the previous round.

This results in a weaker model, but its modifications enable it to capitalize on the mistakes of the old model. Machine learning algorithms such as gradient boosting are popular due to their ability to learn from their mistakes.

GB implementation involves the following steps:

1 — Import libraries
2 — Import dataset
3 — Convert non-numeric variables
4 — Remove variables
5 — Set X and y variables
6 — Set algorithm
7 — Evaluate

Implementation Of Gradient Booster Classifier

In this third exercise, gradient boosting will be used to predict the outcome of the Advertising dataset so that the results can be compared with those from the previous two exercises.

The regression variant of gradient boosting will be familiar to readers of Machine Learning for Absolute Beginners Second Edition. Rather than using the classification version of the algorithm in this exercise, we will use a classification version which predicts a discrete variable using slightly different hyperparameters.

1: Import Libraries

Using Scikit-learn’s ensemble package, this model uses Gradient Boosting for classification.

2: Import Dataset

The Advertising dataset should be imported and assigned as a variable.

3: Convert Non Numerical Variables

4: Remove Variables

Remove the following two variables from the data frame.

5: Set X and Y Variables

X and Y variables should be assigned the same values, and data should be split 70/30.

6: Set Algorithm

Gradient Boosting Classifier should be given a variable name and its number of estimators should be specified. With a learning rate of 0.1 and a deviance loss argument, 150–250 estimators (trees) are a good starting point for this algorithm.

7: Evaluate

Assign the predicted X test values to a new variable using the predict method.

There was one less prediction error in this model than in random forests. In spite of this, the f1-score remains 0.96, the same as random forests, but better than the classifier which was developed earlier (0.94).

Implementation of Gradient Boosting Regressor

This exercise uses gradient boosting to estimate the nightly Airbnb rate for Berlin, Germany using a regression model. Then we’ll test a sample listing based on our initial model.

1: Import Libraries

Import the following libraries:

2: Import Dataset

Using the Berlin Airbnb dataset from Kaggle, we will perform this regression exercise.

The Berlin Airbnb dataset can be loaded into a Pandas dataframe by using the pd.read_csv command.

3: Remove Variable

It is not necessary to use all X variables for our model because some information overlaps between them (for example, host_name overlaps with host_id and neighborhood overlaps with neighborhood_group). Also, some variables are irrelevant, such as last_review, while others are discrete and difficult to parse, such as longitude and latitude.

Introduction To Tree Based ML Methods

Gradient boosting, decision trees, and random forests are some of the tree-based methods in this lesson.

Decision Trees

Random Forest

Gradient Boosting

Introduction To Decision Trees

Exercise

Decision Trees Implementation

1: Import Libraries

2: Import Dataset

3: Convert Non-Numerical Variables

4: Remove Columns

5: Set X and Y Variables

6: Set Algorithm

7: Evaluate

Random Forest

Exercise

Implementation Of Random Forest

1: Import Libraries

2: Import Dataset

3: Convert non-numeric variables

4: Remove Variables

5: Set X and Y Variable

6: Set Algorithm

7: Evaluate

Gradient Boosting

Implementation Of Gradient Booster Classifier

1: Import Libraries

2: Import Dataset

3: Convert Non Numerical Variables

4: Remove Variables

5: Set X and Y Variables

6: Set Algorithm

7: Evaluate

Implementation of Gradient Boosting Regressor

1: Import Libraries

2: Import Dataset

3: Remove Variable

This post is for paid subscribers