0

I was performing some simple subsetting on my data when I tried to remove all rows with a negative value.

Example code:

df1 <- df1[df1$var2 >= 0,]

In my current dataset this action should remove four rows. However, no rows are removed after executing.

Strangely, the following code does only leaves those four rows that should be removed by the first code:

 df1 <- df1[df1$var2 < 0,]

I found that using subset(df1, var2 >= 0) does work and removes the four rows that I want to have removed. But I always thought that the first code was the same as using subset()? Does someone know why the very first code doesn't work the way I intend it to?

Edit, including some data from my dataset:

> dput(df1[1:10,215:220])
structure(list(KBUY_CER = c(3L, 0L, 0L, 0L, 0L, 3L, 2L, 0L, 3L, 
2L), KBUY_PRO = c(1L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 1L, 1L), KBUY_DEF = c(1L, 
0L, 0L, 0L, 0L, 1L, 0L, 0L, 1L, 0L), THINK_SP = c(NA, 1L, NA, 
NA, NA, NA, NA, 1L, 1L, NA), dwifexp = c(NA, 0L, NA, NA, NA, 
NA, NA, 0L, 0L, NA), dwifdoll = c(NA, 1500L, NA, NA, NA, NA, 
NA, 600L, 600L, NA)), row.names = c("389", "390", "391", "392", 
"393", "394", "395", "396", "397", "398"), class = "data.frame")

All columns I've selected data from work as intented, only column 215 (KBUY_CER) seems to have this problem.

I just found out that the four rows that should be removed are actually not negative but NA values, which explains why both ['s don't remove those rows from the selection. But does subset() always remove NA values then?

xiVoiix
  • 1
  • 1
  • 2
    Difficult to tell without a reproducible example but that could happen if you have floating point values in `var2`. Read https://stackoverflow.com/questions/9508518/why-are-these-numbers-not-equal and https://stackoverflow.com/questions/588004/is-floating-point-math-broken – Ronak Shah Apr 05 '20 at 13:14
  • Is `var2` numeric? Are you resetting `df1` after running these lines that modify it? Please share some data reproducibly, like `dput(df1[1:10, ])`. – Gregor Thomas Apr 05 '20 at 13:18
  • Thanks, and then subset avoids the problem with floating point values @RonakShah? – xiVoiix Apr 05 '20 at 13:24
  • No - `subset` and `[` behave exactly the same regarding floating point values... and floating point issues generally come about when comparing to decimal numbers, comparisons with 0 should be just fine. I doubt it's relevant here. What would help a lot is a reproducible example with some sample data, e.g., `dput(df1[1:10, ])`. Then we can really see what's going on. – Gregor Thomas Apr 06 '20 at 13:59
  • @GregorThomas I edited my original post with a part of the dataset. Hope it helps. – xiVoiix Apr 06 '20 at 23:40
  • Yes, the `[` and `subset` behavior is different for missing values. When you put `NA` inside `[`, the result is `NA` a row of `NA`s. Many people use `which` with `[` to make it behave like subset, e.g., `df1[which(df1$THINK_SP > 0), ]` – Gregor Thomas Apr 07 '20 at 01:48

0 Answers0