0

I want to run a tree regression. The data is this format:

           L2         L3        L4        L5        L6 ele         ndvi    nd_var nd_ps ldclas
1 0.010814554 0.11304182 0.1360298 0.2098749 0.2437155 179  0.012483470  286688.2  7361   agri
2 0.010853562 0.10954640 0.1279681 0.1986370 0.2224236 183 -0.005020924  383210.9  7353   agri
3 0.011879258 0.12245614 0.1507865 0.2681184 0.2980641 184  0.005531083 1210329.6  7539   agri
4 0.009947186 0.09288491 0.1018834 0.2433811 0.2778357 193 -0.043884473  372672.2  7189   agri
5 0.010979766 0.10698310 0.1283619 0.2131286 0.2349639 193 -0.022636201  472360.7  7392   agri
6 0.011418039 0.11616439 0.1401070 0.2539036 0.3128864 195 -0.001042468  629364.2  7263   agri

ldclas is the dependent variable. ldclas has 10 levels, namely agri, tea, teak, rubber etc..

output of dput(tt) is

structure(list(L2 = c(0.00912571167754499, 0.00930928144178689, 
0.00934829001668829, 0.0088274108106519, 0.00936205774900643, 
0.00895361502356821, 0.00898573973231054, 0.00755389557122373, 
0.0075997880122842, 0.00758602027996606, 0.00788891039096519, 
0.00775582231188981, 0.00781777710732146, 0.00793250820997264, 
0.00815738117116897, 0.00817114890348711), L3 = c(0.0878981140668165, 
0.0923722488117655, 0.0880612335627261, 0.0763632354274946, 0.0775283746839917, 
0.082748198553099, 0.0864766441738899, 0.0545518285458678, 0.0588628437949073, 
0.0566956847778226, 0.0579540351748395, 0.0588628437949073, 0.0606105526796531, 
0.0575345850425006, 0.0649681734989524, 0.0623116559941389), 
    L4 = c(0.0848333226476736, 0.0903004613645694, 0.088516691528972, 
    0.073088240743156, 0.0761924635739359, 0.0779299017254917, 
    0.0815206072387071, 0.036532542034421, 0.0375518390833337, 
    0.0378298291875827, 0.0388722920785162, 0.0384089752381013, 
    0.0395672673391385, 0.0402622425997609, 0.0436212896927688, 
    0.0423240025396071), L5 = c(0.22561265031896, 0.236273695432274, 
    0.208398062322137, 0.17396888632849, 0.135616814946827, 0.208075000349006, 
    0.217836087108599, 0.118148392542544, 0.198013927471506, 
    0.166792295353943, 0.149716162488461, 0.183937655785095, 
    0.18880666123728, 0.129386334036449, 0.223697354335399, 0.193560287413347
    ), L6 = c(0.177203322015849, 0.200068266889341, 0.190253179119034, 
    0.163732501780303, 0.16849603196228, 0.173259562144258, 0.184647722672334, 
    0.0603306628998872, 0.0772578120116587, 0.0753302439845328, 
    0.0664678622506211, 0.0696583196748293, 0.0774350596463369, 
    0.0615492403883001, 0.0991922068030903, 0.0796728110341496
    ), ele = c(666, 773, 766, 678, 787, 809, 857, 738, 748, 855, 
    500, 612, 588, 397, 261, 258), ndvi = c(-0.0283995447391665, 
    -0.0135402419404802, -0.0395083528567925, -0.0819444409706586, 
    -0.103586067539291, -0.0490366118119649, -0.0288226681221347, 
    -0.17071641510358, -0.136545326259316, -0.154017449391041, 
    -0.16240155229558, -0.146503439773889, -0.136064892814646, 
    -0.168614157809797, -0.122837753698589, -0.144167470536185
    ), nd_var = c(131202.666666667, 433640.666666667, 461440.222222222, 
    210334.888888889, 79202, 4817.55555555556, 55640.6666666667, 
    105110.222222222, 263000.888888889, 63993.5555555556, 95738.8888888889, 
    29214, 34386.8888888889, 74852.6666666667, 63421.5555555556, 
    47259.5555555556), nd_ps = c(7836, 7407, 8644, 7460, 8731, 
    7675, 8202, 8457, 8160, 8152, 7705, 8108, 8016, 7898, 7901, 
    7946), ldclas = structure(c(4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 
    5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L), .Label = c("agri", "coconut", 
    "DDF", "grass", "MDF", "rubber", "tea", "teak", "water", 
    "young rubber"), class = "factor")), .Names = c("L2", "L3", 
"L4", "L5", "L6", "ele", "ndvi", "nd_var", "nd_ps", "ldclas"), row.names = 95:110, class = "data.frame")

I have used the following code:

library(party)
ct <- ctree(ldclas ~ L2 + L3 + L4 + L5 + L6 + ele + ndvi + nd_var + nd_ps, data = tt)

I get the result like:

1) ele <= 637; criterion = 1, statistic = 216.044
  2) L3 <= 0.09185959; criterion = 1, statistic = 187.431
    3) L5 <= 0.05141302; criterion = 1, statistic = 165.797
      4)*  weights = 12

But I am not able to know which class of dependent variable is segregated in the regression tree. Eg: which class of response variable is getting classified for ele >637 and what is the code to get this on the plot?

Karthik K
  • 25
  • 6
  • Try `n <- nodes(ct, 1)[[1]];` `unique(tt[which(as.logical(n$weights)), 10])` – David Arenburg Apr 30 '14 at 10:31
  • Thanks. I am getting the same for n. – Karthik K Apr 30 '14 at 10:40
  • Not sure what you mean. Did it solve your problem? – David Arenburg Apr 30 '14 at 10:41
  • I ran the code that you had suggested. I got the same result that used to get earlier. – Karthik K Apr 30 '14 at 10:53
  • What was the result? What is exactly your desired output? Can you provide a `dput(tt)` and which package did you use `party` or `partykit`?. Also, what kind of plot do you want? What's wrong with `plot(ct)`? – David Arenburg Apr 30 '14 at 10:55
  • I am getting the same result that I have put up in the main question section. I am using the package "party". As there are 10 levels in my response variable (like "agri", "grass", "tea", "teak" etc), I want to know which of these response classes are segregated at each level in the regression tree. The regression tree ends depicting nodes with numbers like n=22, which I am not able to understand. For example if "grass" (of ldclas) is segregated for ele > 637, how do I know in the plot. Currently it is showing n values which I am not able to understand. – Karthik K Apr 30 '14 at 11:12
  • please provide `dput(tt)` because your question isnt reproducible. See [here](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – David Arenburg Apr 30 '14 at 11:15
  • Thanks for your response! I have added the dput(tt) in the main question section. – Karthik K Apr 30 '14 at 11:31
  • this is the wrong deput. You have there only 10 observations, we won't be able to reporduce your problem. Let me answer you with an example data, and you tell how exactly to edit my answer – David Arenburg Apr 30 '14 at 11:33
  • See my answer below, please feel free to comment on it n order to get better understanding – David Arenburg Apr 30 '14 at 11:53

1 Answers1

0

I'm not sure I fully understand your question and you haven't provided any reproducible example, so I'll try to wrap it up with a stand alone example and adress your comments on the way

So lets run a classification tree with response variable with 3 different levels

library(party)
irisct <- ctree(Species ~ .,data = iris)
plot(irisct)

enter image description here

So the plot shows us what is the distribution (in percentage) of the explained variable in each leaf (terminal node). For example, you can see that in node number 2, we have 100% of Setosa. The n=50 (you asked about it the comments) means that we have 50 observations in that specific node (not unique, but overal). Now if we want to see the tree structure, we can do:

irisct
##1) Petal.Length <= 1.9; criterion = 1, statistic = 140.264
##  2)*  weights = 50 
##1) Petal.Length > 1.9
##  3) Petal.Width <= 1.7; criterion = 1, statistic = 67.894
##    4) Petal.Length <= 4.8; criterion = 0.999, statistic = 13.865
##      5)*  weights = 46 
##    4) Petal.Length > 4.8
##      6)*  weights = 8 
##  3) Petal.Width > 1.7
##    7)*  weights = 46 

You can see that in 2)* weights = 50, wieghts = 50, which means we have 50 observations in that node. As we didn't specify the weights parameter in the ctree(), ctree by default sets a weight of 1 to each observation (you can set the weights parameter differntly, see ?ctree). You can also see * at some nodes, which means they are terminal nodes.

Now to get to you main question, you can get the distribution of each level in each node (no metter if it's terminal or not) by using the following code

target <- "Species" # your explained variable, which will be "ldclas" in your case
Node <- 5 # the node you want to investigate
n <- nodes(irisct, Node)[[1]] # retreving the weights of that node
x <- iris[which(as.logical(n$weights)), ] # retreiving all inforamtion for that node
paste(paste(names(table(x[target])), ": ", round((as.numeric(table(x[target]))/nrow(x))*100, 3), "%", sep = ""), collapse = ", ")
## [1] "setosa: 0%, versicolor: 97.826%, virginica: 2.174%"

The output gives you the distribution of each level in that specific node

Hope that was what you needed

David Arenburg
  • 91,361
  • 17
  • 137
  • 196