0

I have a large data.frame (50M rows) and need to find indices which match a multiple column criteria.

I made a toy example to illustrate the problem.

I read here that it's faster to search though data.tables than data.frames so I converted data.frame into a data.table object

cases<-c(1,3,5)
women<-c("Julia", "Judith", "Juno", "Jane", "Joanna")
data.df<-data.frame("id" = 1:5, "age" = c(20, 30, 40, 50, 60), "name" = c("Joanna","Joe", "Julia", "Juno", "John"))

library(data.table)
data.dt<-as.data.table(data.df)
setkey(data.dt, "id")

I want the results vector to contain age values of records which pass the multiple column criteria, in this case 20, 40, NA. I search using the for loop (which is probably a silly approach, any tips would be greatly appreciated here)

results<-vector()
for (i in 1:length(cases)){
which_id<-cases[i]
ind<-data.dt[id==which_id & name %in% women, which=TRUE]
if(length(ind)==0){results[i]<-NA}
else{results[i]<-data.dt$age[ind]}
}

This would do on a smaller data set but with 500K records in cases and 50M in data.df it is going to take an excess of 12 hours to run. There must be a simpler way, can anyone give a hint?

3 Answers3

1

If I understand correctly, the OP wants to filter his dataset by id first and then to return age if the name is found in women (and NA otherwise).

Below are different data.table approaches which return the expected result

20 40 NA

for the example case. However, performance may vary for the production dataset.

1. Filtering by id, matching on women

setkey(data.dt, id)
data.dt[cases][name %in% women, Age := age][, Age]

Here, integer matching is used as name has been turned into factor by OP's call to data.frame(). (If name is of type character %chin% could have been used).

To make sure NA is returned in case of no matches, an update in place is used which puts NA by default.

Note that Cole's approach data.dt[J(cases)][name %in% women] would only return a filtered dataset with 2 rows but not the expected result.

2. Filtering by id, joining with women

This is similar to above but uses a join instead of a match:

setkey(data.dt, id)
data.dt[cases][.(women), on = .(name = V1), Age := age][, Age]

3. Joining with women, then joining with cases

This approach selects the rows where name matches women first (by joining), then it right joins with cases so that each case will have its corresponding entry in the result vector:

setkey(data.dt, id)
data.dt[.(women), on = .(name = V1), nomatch = 0L][cases, age]

Discussion

The OP has pointed out that speed is a concern with a production dataset of 500k elements in cases and 50M rows in data.df. Which of above methods is the fastest for the production dataset may also depend on the number of entries in women.

Without proper benchmarking I am reluctant to recommend one of the methods.

Uwe
  • 41,420
  • 11
  • 90
  • 134
  • Hi, thank you for a comprehensive answer. I tried the first approach and it works (how fast it is!) but will check with the other two methods and see which one is the fastest. – Pipps Blitz Dec 06 '19 at 13:48
0

I assume that you expect ind to be either of length 1 or 0 (meaning that the IDs are all different).

Then, you can use {dplyr} to do that very fast:

library(dplyr)
results2 <- data.df %>%
  slice(match(cases, id)) %>%
  mutate(res = ifelse(name %in% women, age, NA)) %>%
  pull(res)
F. Privé
  • 11,423
  • 2
  • 27
  • 78
  • Thanks for pointing this out! You are right that it'd be better to leave results as a list. Your solution should work for what I am trying to do, ie. determining the pairs of people who become married in each year but in reality there will be these curious cases with more than 2 people matching the criteria and hence ind greater than 1. – Pipps Blitz Dec 06 '19 at 13:46
0

Maybe you can try the following code

(v<-data.df[cases,])$age[match(v$name,women)>0]

or

(v<-data.dt[cases,])$age[match(v$name,women)>0]
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81