0

I am trying to debug a code in R in order to understand it. The statements are as follows:

library(rpart)
X = read.csv("strange_binary.csv");
fit  = rpart(c ~ X + X.1 + X.2 + X.3 + X.4 + X.5 + X.6 + X.7 + X.8 + X.9, method ="class",data=X,minbucket=1,cp=.04);
printcp(fit);
fit = prune(fit,cp=.04);

pred = predict(fit,X[,1:10],type="vector")      # test the classifier on the training data
pred[pred == 2] = "bad"
pred[pred == 1] = "good"

The aim is to build a classifier and to test it on the training data. However, I do not understand the statements:

pred[pred == 2] = "bad"
pred[pred == 1] = "good"

pred==2 and pred==1 would be either TRUE or FALSE - how is it being used to index a vector? Sorry for my naive question, I am from a C++ background and taking baby steps in R.

Thanks for your help!

  • 1
    You can use a logical vector to select array elements. Type `?"["` for help. – G5W Apr 13 '17 at 00:25
  • 1
    Try something like `x = c("a", "b", "c", "d"); x[c(FALSE, TRUE, TRUE, FALSE)]`. Logical indexing/subsetting like this is very common in R. – Marius Apr 13 '17 at 00:28
  • `pred` is a vector from the result of `predict`. So it looks like the model predicts result as either 1, or 2, and that statement just changes the result to characters strings "good" and "bad" respectively. – Andrew Lavers Apr 13 '17 at 00:28
  • @epi99, could you please elaborate? I am using standard function `predict()` for prediction. How does this limit the predicted values to `1` and `2` then? The `strange_binary.csv` file has values `0` and `1`. –  Apr 13 '17 at 00:31
  • 1
    Were you expecting different output from `predict()`? `predict` is a generic function that acts differently depending on the type of model fit you pass to it, and the `type` argument you give. I assume that for your model, getting predicted classes, i.e. only `1` and `2`, makes sense. If you want some other type of prediction, you should explore the `type` options for your specific model. – Marius Apr 13 '17 at 00:36
  • @user6490375, my understanding is that predict is a generic function, that can be applied to different classes. `rpart` returns an object which know how to to do the prediction, so the the result is really determined by the specific model (rpart) and how it is set up. I dont know much about rpart specifically. – Andrew Lavers Apr 13 '17 at 00:37
  • @Marius, the `strange_binary.csv` files has 0s and 1s. What I am wondering is how are the predicted values 1 and 2? –  Apr 13 '17 at 00:41

1 Answers1

1

This is a way of saying: Assign the value "bad" to the subset of pred where pred is equal to 2

pred[pred == 2] = "bad"

Assign the value "good" to the subset of pred where pred is equal to 1

pred[pred == 1] = "good"

A more R-like way of assigning values would look like this:

pred[pred == 2] <- "bad"
pred[pred == 1] <- "good"

So it creates classes based on the logic of pred being equal to one or the other of those two values.

EDIT:

Because you asked in the comment what it is as well. I would recommend executing your code above a single line at a time. At each stage you can see what has changed by using: str() to see the structure of your new variable. It will give you dimensions, and types for the data with a few examples.

str(fit)
str(pred)

It will help you get a feel for what is occurring at each step.

ikop
  • 1,760
  • 1
  • 12
  • 24
sconfluentus
  • 4,693
  • 1
  • 21
  • 40
  • I don't think using `<-` for assignment is "more R-like". It's mostly personal preference, and it makes no difference in this case. – Marius Apr 13 '17 at 00:33
  • What exactly is `pred`? I mean to ask, which column (or row) of `pred` will be affected? And why on that one? –  Apr 13 '17 at 00:34
  • 1
    I believe that most style guides still recommend `<-` for assignment. It may make no difference in most cases, but there are cases where it does. See http://stackoverflow.com/questions/1741820/assignment-operators-in-r-and. – neilfws Apr 13 '17 at 00:38
  • pred is the variable to which you have assigned the results of this `pred = predict(fit,X[,1:10],type="vector") ` which is the prediction results from the equation two lines prior. @marius, either way of assigning is completely acceptable and equivalent, and thus will work, it is just a convention of assignment in R. It was not the major point of my explanation, just something I thought someone new to R might need if they were not used to it and will likely find it in other explanations. – sconfluentus Apr 13 '17 at 00:39