Search for data.frame indices according to multiple column criteria - how to speed up for large data sets?

Question

I have a large data.frame (50M rows) and need to find indices which match a multiple column criteria.

I made a toy example to illustrate the problem.

I read here that it's faster to search though data.tables than data.frames so I converted data.frame into a data.table object

cases<-c(1,3,5)
women<-c("Julia", "Judith", "Juno", "Jane", "Joanna")
data.df<-data.frame("id" = 1:5, "age" = c(20, 30, 40, 50, 60), "name" = c("Joanna","Joe", "Julia", "Juno", "John"))

library(data.table)
data.dt<-as.data.table(data.df)
setkey(data.dt, "id")

I want the results vector to contain age values of records which pass the multiple column criteria, in this case 20, 40, NA. I search using the for loop (which is probably a silly approach, any tips would be greatly appreciated here)

results<-vector()
for (i in 1:length(cases)){
which_id<-cases[i]
ind<-data.dt[id==which_id & name %in% women, which=TRUE]
if(length(ind)==0){results[i]<-NA}
else{results[i]<-data.dt$age[ind]}
}

This would do on a smaller data set but with 500K records in cases and 50M in data.df it is going to take an excess of 12 hours to run. There must be a simpler way, can anyone give a hint?

please see this vignette https://cran.r-project.org/web/packages/data.table/vignettes/datatable-keys-fast-subset.html — MichaelChirico, Dec 06 '19 at 03:38
After reading the vignette I see that I was using data.tables same way as if I was using data.frame — Pipps Blitz, Dec 06 '19 at 13:51

Uwe · Accepted Answer · 2019-12-06T11:51:27.773

If I understand correctly, the OP wants to filter his dataset by id first and then to return age if the name is found in women (and NA otherwise).

Below are different data.table approaches which return the expected result

20 40 NA

for the example case. However, performance may vary for the production dataset.

1. Filtering by `id`, matching on `women`

setkey(data.dt, id)
data.dt[cases][name %in% women, Age := age][, Age]

Here, integer matching is used as name has been turned into factor by OP's call to data.frame(). (If name is of type character %chin% could have been used).

To make sure NA is returned in case of no matches, an update in place is used which puts NA by default.

Note that Cole's approach data.dt[J(cases)][name %in% women] would only return a filtered dataset with 2 rows but not the expected result.

2. Filtering by `id`, joining with `women`

This is similar to above but uses a join instead of a match:

setkey(data.dt, id)
data.dt[cases][.(women), on = .(name = V1), Age := age][, Age]

3. Joining with `women`, then joining with `cases`

This approach selects the rows where name matches women first (by joining), then it right joins with cases so that each case will have its corresponding entry in the result vector:

setkey(data.dt, id)
data.dt[.(women), on = .(name = V1), nomatch = 0L][cases, age]

Discussion

The OP has pointed out that speed is a concern with a production dataset of 500k elements in cases and 50M rows in data.df. Which of above methods is the fastest for the production dataset may also depend on the number of entries in women.

Without proper benchmarking I am reluctant to recommend one of the methods.

Hi, thank you for a comprehensive answer. I tried the first approach and it works (how fast it is!) but will check with the other two methods and see which one is the fastest. — Pipps Blitz, Dec 06 '19 at 13:48

score 0 · Answer 2 · answered Dec 06 '19 at 06:58

0

I assume that you expect ind to be either of length 1 or 0 (meaning that the IDs are all different).

Then, you can use {dplyr} to do that very fast:

library(dplyr)
results2 <- data.df %>%
  slice(match(cases, id)) %>%
  mutate(res = ifelse(name %in% women, age, NA)) %>%
  pull(res)

answered Dec 06 '19 at 06:58

F. Privé

11,423
2
27
78

Thanks for pointing this out! You are right that it'd be better to leave results as a list. Your solution should work for what I am trying to do, ie. determining the pairs of people who become married in each year but in reality there will be these curious cases with more than 2 people matching the criteria and hence ind greater than 1. – Pipps Blitz Dec 06 '19 at 13:46

score 0 · Answer 3 · answered Dec 06 '19 at 09:53

0

Maybe you can try the following code

(v<-data.df[cases,])$age[match(v$name,women)>0]

or

(v<-data.dt[cases,])$age[match(v$name,women)>0]

answered Dec 06 '19 at 09:53

ThomasIsCoding

96,636
9
24
81

Search for data.frame indices according to multiple column criteria - how to speed up for large data sets?

3 Answers3

1. Filtering by id, matching on women

2. Filtering by id, joining with women

3. Joining with women, then joining with cases

Discussion

1. Filtering by `id`, matching on `women`

2. Filtering by `id`, joining with `women`

3. Joining with `women`, then joining with `cases`