-4

I have a dataset and I would like to replace values in the dataset under some conditions.

set.seed(100)
Mydata=sample(-5:5,size = 1000,replace = T)
Mydata=as.data.frame(matrix(Mydata,nrow = 100))

Mydata[Mydata<=-1 & Mydata>-1.5] = "A"
Mydata[Mydata<=-1.5 & Mydata>-2] = "B"
Mydata[Mydata<=-2] = "C"
Mydata[Mydata>-1] = "D"

The result should be a dataframe filled with "A","B","C", and "D". However, when I run the code, the result is filled with just "D". I wonder what the problem is. Thanks.

enter image description here

thelatemail
  • 91,185
  • 12
  • 128
  • 188
Yang Yang
  • 858
  • 3
  • 26
  • 49
  • 1
    You could check `?cut` – akrun Dec 21 '16 at 04:02
  • Thanks. Could you explain why my code is wrong. Thanks a lot. – Yang Yang Dec 21 '16 at 04:04
  • It is based on overwriting as @thelatemail commented. The conditions in the latter step completely satisfies the condition. Also with `cut`, something like `cut(Mydata, breaks = c(-Inf, -2, -1.5, -1, Inf), labels = LETTERS[1:4])` – akrun Dec 21 '16 at 04:06
  • 2
    As per @akrun ' s `cut` suggestion, try `data.frame(lapply(Mydata, cut, breaks=c(-Inf, -2, -1.5, -1, Inf), labels=c("C","B","A","D")))` – thelatemail Dec 21 '16 at 04:08
  • Possible duplicate of [Group numeric values by the intervals](http://stackoverflow.com/questions/13559076/group-numeric-values-by-the-intervals) – Ronak Shah Dec 21 '16 at 04:12

2 Answers2

3

The problem has to do with the fact that you're replacing numbers with characters. Vectors can only have elements of one class, so when you replace some of the elements with "A" in your first step, all of the columns with those elements are coerced to character vectors. Check it out:

> set.seed(100)
> Mydata=sample(-5:5,size = 50,replace = T)
> Mydata=as.data.frame(matrix(Mydata,nrow = 10))
> str(Mydata)
'data.frame':   10 obs. of  5 variables:
 $ V1: int  -2 -3 1 -5 0 0 3 -1 1 -4
 $ V2: int  1 4 -2 -1 3 2 -3 -2 -2 2
 $ V3: int  0 2 0 3 -1 -4 3 4 1 -2
 $ V4: int  0 5 -2 5 2 4 -4 1 5 -4
 $ V5: int  -2 4 3 4 1 0 3 4 -3 -2
> Mydata[Mydata<=-1 & Mydata>-1.5] = "A"
> str(Mydata)
'data.frame':   10 obs. of  5 variables:
 $ V1: chr  "-2" "-3" "1" "-5" ...
 $ V2: chr  "1" "4" "-2" "A" ...
 $ V3: chr  "0" "2" "0" "3" ...
 $ V4: int  0 5 -2 5 2 4 -4 1 5 -4
 $ V5: int  -2 4 3 4 1 0 3 4 -3 -2

Interestingly enough, it turns out R will allow you to use characters in tests of (in)equality. So when you apply the subsequent rules, it will continue to replace character values that satisfy the inequality rather than throwing a warning or error. For example:

> char_vec <- c("A", 1, 2, -1)
> char_vec
[1] "A"  "1"  "2"  "-1"
> char_vec > 0
[1]  TRUE  TRUE  TRUE FALSE

It turns out all upper case letters (and all lower case letters, for that matter) are greater than -1, so the whole matrix ends up getting replaced by D's in the last step.

> toupper(letters) > -1
 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[19] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

The easiest way to prevent this behavior is by using ifelse, as pointed out by Aaghaz. Another option would be to create a new matrix rather than progressively overwriting the original:

> Newdata <- Mydata
> Newdata[Mydata<=-1 & Mydata>-1.5] = "A"
> Newdata[Mydata<=-1.5 & Mydata>-2] = "B"
> Newdata[Mydata<=-2] = "C"
> Newdata[Mydata>-1] = "D"
Rose Hartman
  • 457
  • 4
  • 11
2

You can use ifelse

ifelse(Mydata <= -1 & Mydata > -1.5, "A",
       ifelse(Mydata <= -1.5 & Mydata > -2, "B",
              ifelse(Mydata <= -2, "C", "D")))

Or by a vectorised if which is more strict (checks that true and false are the same type) and faster than base ifelse

if_else(Mydata <= -1 & Mydata > -1.5, "A",
           if_else(Mydata <= -1.5 & Mydata > -2, "B",
                  if_else(Mydata <= -2, "C", "D")))
Agaz Wani
  • 5,514
  • 8
  • 42
  • 62
  • Yes, your code works. Could you tell me why my code is wrong? Thanks. – Yang Yang Dec 21 '16 at 04:00
  • 3
    I suspect it's because you keep overwriting the original `Mydata` at each step, causing the comparisons to no longer make sense. – thelatemail Dec 21 '16 at 04:01
  • I think it is because after this line `Mydata[Mydata<=-1 & Mydata>-1.5] = "A"` the type of each column is changed from int to chr. Try running `str(Mydata)` before and after. And `"-5" > "-1"` as is "0" and "5" and "A", "B", "C"... – Mist Dec 21 '16 at 04:05
  • @thelatemail Thank you so much. After the first step, `Mydata` has changed, with str and int. – Yang Yang Dec 21 '16 at 04:08