Lab: Categorical And Regression Trees (CART)

This lab will introduce you to modeling using Categorical and Regression Trees (CART). These trees can model continuous or categorical data but are particularly well suited for categorical response data. In R there are two CART libraries, tree and rpart. Rpart is the newer of the two and is used by Plant so we'll use it.

Note that for this lab, you'll need the rpart and the sp (spatial) libraries and use na.omit() to remove invalid values or your prediction will have a different number of rows than your original data.

Exploring Together

Functions for Synthetic Data

The code below contains a series of functions that we'll be using to create synthetic data sets in R. Copy this code into R Studio and compile it to create the functions.

library(rpart)
library(sp)
######################################################################################
# Creates a two dimensional pattern of data increasing from the lower left
# to the upper right
######################################################################################

Linear2D=function(NumEntries=100)
{
  Ys=as.vector(array(1:NumEntries))
  Xs=as.vector(array(1:NumEntries))
  Measures=as.vector(array(1:NumEntries))
  for (Index in 1:NumEntries)
  {
    Xs[Index]=runif(1,0,100)
    Ys[Index]=runif(1,0,100)
    Measures[Index]=Xs[Index]+Ys[Index]
  }
  TheDataFrame = data.frame(Ys, Xs, Measures)
}

######################################################################################
# Creates a 2 dimensional data set with a single split to the data in the X direction.
######################################################################################

Categories1D=function(NumEntries=100)
{
  Ys=as.vector(array(1:NumEntries))
  Xs=as.vector(array(1:NumEntries))
  Measures=as.vector(array(1:NumEntries))
  for (Index in 1:NumEntries)
  {
    Xs[Index]=runif(1,0,100)
    Ys[Index]=runif(1,0,100)
    if (Xs[Index]>50) Measures[Index]=1
    else Measures[Index]=2
  }
  TheDataFrame = data.frame(Ys, Xs, Measures)
}
######################################################################################
# Creates a 2 dimensional data set with two splits one in the x and one in the y 
# direction.
######################################################################################

Categories2D=function(NumEntries=100)
{
  Ys=as.vector(array(1:NumEntries))
  Xs=as.vector(array(1:NumEntries))
  Measures=as.vector(array(1:NumEntries))
  for (Index in 1:NumEntries)
  {
    Xs[Index]=runif(1,0,100)
    Ys[Index]=runif(1,0,100)
    if (Xs[Index]>50) 
    {
      if (Ys[Index]>50) Measures[Index]=2
      else Measures[Index]=1
    }
    else
    {
      if (Ys[Index]<50) Measures[Index]=2
      else Measures[Index]=1
    }
    
  }
  TheDataFrame = data.frame(Ys, Xs, Measures)
}

Creating Trees

The code below will run one of the functions above and then create a bubble chart to show the distribution of the data. Remember that this is in a two-dimensional predictor space.

Note that we need to create a copy of the data before converting it to a "SpatialDataFrame". This will allow us to use the variable "TheDataFrame" in analysis in the next step.

TheDataFrame=Categories1D(100) # creates two areas with different values 
SpatialDataFrame=TheDataFrame # make a copy
coordinates(SpatialDataFrame)= ~ Xs+Ys # converts to a SpatialPointDataFrame
bubble(SpatialDataFrame, zcol='Measures', fill=TRUE, do.sqrt=FALSE, maxsize=3,scales = list(draw = TRUE), xlab = "Xs", ylab = "Ys")

The code below will create a tree and output the results. Run this for the first of the synthetic tree functions. Then, take a look at the results and see what you think.

TheDataFrame=na.omit(TheDataFrame) # remove any blanks
TheTree=rpart(Measures~Xs+Ys,data=TheDataFrame,method="anova") # create the tree

par(mar=c(2,2,1,1)) # plot margins: bottom,left, top,right
plot(TheTree) # plot the tree (without labels)
text(TheTree) # add the labels to the branches
print(TheTree) # print the equations for the tree
summary(TheTree)

Printing the model will produce the following results. Note that in the rpart() function above, we use "anova". Because of this the deviance will simply be the sum of the square of the differences between our model's predicted values and our data values.

n= 100 

node), split, n, deviance, yval
      * denotes terminal node

 1) root 100 162658.6000  91.22948  
   2) Ys< 39.97511 48  43603.5900  63.60367  
     4) Xs< 37.85148 22   4778.8340  37.52148  
       8) Xs< 27.23887 15   1611.7370  30.71262 *
       9) Xs>=27.23887 7    981.5252  52.11190 *
     5) Xs>=37.85148 26  11194.8900  85.67321  
      10) Xs< 64.17446 15   2253.3050  72.29593 *
      11) Xs>=64.17446 11   2596.9370 103.91500 *
   3) Ys>=39.97511 52  48607.2000 116.73020  
     6) Xs< 57.2601 31  13012.0500  97.52345  
      12) Xs< 26.92681 15   4944.0230  82.21864 *
      13) Xs>=26.92681 16   1260.5070 111.87170 *
     7) Xs>=57.2601 21   7277.6820 145.08310  
      14) Ys< 76.65642 14   2322.3320 135.85410 *
      15) Ys>=76.65642 7   1378.0720 163.54090 *

Note that this is a "code" version of the tree. It starts with the "root" of the tree at node 1. Node 2 shows a split (to the left) when the Y value is less than 39.97511 (Ys< 39.97511). Note that the other side of this split is at node 3 and is for Ys>=39.97511. The tree continues until we have terminal nodes. For each line of this version of the tree, there is also the number of data points at the node, the deviance explained at that node, and the response value (yval) for the node.

Evaluating the performance of trees with rpart is a little different. Use the "printcp(TheTree)" function to print summary statistics as below.

> printcp(TheTree)
Classification tree:
rpart(formula = Genus ~ Precip + MinTemp + MaxTemp, data = TheData, 
    method = "class", control = TheControls)

Variables actually used in tree construction:
[1] MaxTemp MinTemp Precip 

Root node error: 7715/29763 = 0.25921

n= 29763 

         CP nsplit rel error  xerror      xstd
1 0.0014128      0    1.0000 1.00000 0.0097989
2 0.0010000     17    0.9711 0.99935 0.0097969

First the original formula and parameters are printed and then the covariates that were used in the mode. The "Root node error" is the number and proportion of correctly sorted response values at the first branch. "n" is the total number of response values. The last entries in the table for the "rel error" is the error rate based on the training data while the "xerror" is the error rate based on the test data. The lower these values, the better the model. See Understanding the Outputs of the Decision Tree Tool for more information.

Now, we can take a look at the residuals.

TheResiduals=residuals(TheTree) # gets the residuals for our training data
plot(TheResiduals)
summary(TheResiduals)

Predictions

Now, lets create a prediction of our tree to see the results.

ThePrediction=predict(TheTree) # predicts the original data based on the model
par(mar=c(4,4,4,4))
plot(ThePrediction, TheResiduals)

We can view the predicted output with another bubble chart.

Xs=TheDataFrame$Xs # get the x values
Ys=TheDataFrame$Ys # get the y values

TheDataFrame2 = data.frame(Ys,Xs, ThePrediction) # create a dataframe with the predictions
SpatialDataFrame2=TheDataFrame2 # makes a copy
coordinates(SpatialDataFrame2)= ~ Xs+Ys # converts to a SpatialPointDataFrame
bubble(SpatialDataFrame2, zcol='ThePrediction', fill=TRUE, do.sqrt=FALSE, maxsize=3,scales = list(draw = TRUE), xlab = "Xs", ylab = "Ys")

Try the procedure above for data created from the Categories2D() and Linear2D() functions

TheDataFrame=Categories2D(100) # creates four areas with different values
TheDataFrame=Linear2D(100) # creates a 2 dimentional linear surface

Optomizing the Tree

Now, lets try using "rpart.control()" to create new parameters for some of our data. The rpart.control() function creates an object we can pass into rpart() to change the complexity of the trees that are fitted to the data.

minsplit - minimum number of data points in a node before a split is tried
cp - complexity parameter

To find out about the complexity parameter, it is time for us to start reading the standard R documentation. Search on the web for "R rpart" and then find the standard PDF for this library. Below is an example.

TheControl=rpart.control(minsplit=5,cp=0.2) #changes the allowed complexity of the tree
TheTree=rpart(Measures~Xs+Ys,data=TheDataFrame,method="anova",control=TheControl)
par(mar=c(1,1,1,1)) # margins: bottom,left, top,right
plot(TheTree)
text(TheTree)

CART for Categorial Response Vaiables

CART is probably most useful for situations where your response variable is categorical such as presence/absence, dominante species, or fire severity. Your response data will need to be integers or a "factor" in R for rpart to work correctly. You may also want to explore the older "tree" library as well. This Dominate Tree data set contains counts for the dominate tree species in the FIA plots in California. The genus in this data set can be used as a categorical response variable. Note that we're also using the "class" method instead of "anova".

library(tree) 
library(rpart)

TheData <- read.csv("E:\\jimg\\GSP 570\\CART\\DominnantTrees_100000_Covariates.csv") 

TheData=na.omit(TheData)  # Remove any NA (null) values

TheControls=rpart.control(minsplit=1,minbucket=20, cp=0.001) # to get this dataset to work, we have to set the cp very low
TheTree=rpart(Genus~Precip+MinTemp+MaxTemp,data=TheData,method="class",control=TheControls) # create the tree

Pretty Plots

The default plots for rpart() are not really very attractive. There is a library that will make much nicer plots. The code below shows you how.

library(rpart.plot) # load the library
rpart.plot(TheTree) # plot the tree

Creating Predicted Maps

The process for creating a predicted map for categorical variables from a shapefile is very similar to point files with continuous variables as above until you get to the prediction portion. For categorical variables, you'll need to add type="vector" to obtain the numerical prediction values and type="class" to obtain the names for the categories.

ThePrediction = predict(TheTree,NewData,type="vector") 
ThePredictedNames = predict(TheTree,NewData,type="class")

Then, you can add these vectors back into the data frame and write out the file:

FinalData=data.frame(NewData,ThePrediction)
FinalData=data.frame(FinalData,ThePredictedNames)
write.csv(TheData, file = "C:/Temp/TheDataWithPrediction.csv")

Resources

Refer to the R for Spatial Statistics web site and the section on CART under Section 7, "Correlation/Regression Modeling" and section 8.3 from Plant's book.

Plotting rpart trees with the rpart.plot package

Note: You are welcome to use content from previous labs to complete this lab.

Geospatial Activities