3

In base r I can subset a data.frame based on a row range:

mtcars[1:5,]

Or I can subset based on a logical condition:

mtcars[mtcars$cyl==6,]

But I don't appear to be able to do both:

mtcars[1:5 & mtcars$cyl==6,]

Warning message: In 1:5 & mtcars$cyl == 6 : longer object length is not a multiple of shorter object length

Is there another way to do this?

The use case is loading a huge .csv with the LaF package, which allows for filtering using commands similar to base r, but which loads things much quicker with row ranges than with conditions, and adding more than one condition means that I will sometimes have to wait a day for the data to load.

John Clegg
  • 99
  • 8

2 Answers2

6

In case you work interactively I would use subset.

subset(mtcars[1:5,], cyl==6)
#                mpg cyl disp  hp drat    wt  qsec vs am gear carb
#Mazda RX4      21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#Mazda RX4 Wag  21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1

Or storing the intermediate result.

tt <- mtcars[1:5,]
tt[tt$cyl==6,]
rm(tt)

Alternatively you can chain your two conditions.

mtcars[(1:5)[mtcars$cyl[1:5]==6],]
#mtcars[1:5,][mtcars$cyl[1:5]==6,] #Alternative
#mtcars[1:5,][mtcars[1:5,]$cyl==6,] #Alternative

or storing 1:5 what I would recommend in this case.

i <- 1:5
mtcars[i[mtcars$cyl[i]==6],]
rm(i)
GKi
  • 37,245
  • 2
  • 26
  • 48
  • These are great answers, but for this use case, which involves reading a very large file, I can't copy the data. I'm not sure if LaF accepts the subset commands but will check it out. – John Clegg Sep 07 '20 at 15:39
  • 1
    Maybe you try `i <- 1:5; mtcars[i[mtcars$cyl[i]==6],]` – GKi Sep 08 '20 at 06:08
2

You can do the subsetting using either of the way.

  1. Based on logical vector :
mtcars[seq(nrow(mtcars)) %in% 1:5 & mtcars$cyl==6,]

#                mpg cyl disp  hp drat    wt  qsec vs am gear carb
#Mazda RX4      21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#Mazda RX4 Wag  21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
  1. Based on row range :
mtcars[intersect(1:5, which(mtcars$cyl==6)),]
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Do you have any sense which of these might be faster? The trick is to find a way to get LaF to only search within the row range and never have to count the full dataset (which is 30 GB). I worry that nrow() might force it to do that. – John Clegg Sep 07 '20 at 15:41
  • 1
    Yes in the first option we are generating sequence from 1 to `nrow` which might not be efficient. You can use the `intersect` method. Even better would be `mtcars[intersect(1:5, which(mtcars$cyl[1:5]==6)),]` – Ronak Shah Sep 07 '20 at 23:46