R data.table average if lookup using join

Question

All I want to do is a simple average if (just like the command average if in excel). I am working with data.tables for efficiency as I have rather large tables (~1m rows).

My aim is to look up the

Table 1 
| individual id | date        |
-------------------------------
| 1             |  2018-01-02 |
| 1             |  2018-01-03 |
| 2             |  2018-01-02 |
| 2             |  2018-01-03 |

Table 2 
| individual id | date2       | alpha |
---------------------------------------
| 1             |  2018-01-02 |  1    |  
| 1             |  2018-01-04 |  1.5  |
| 1             |  2018-01-05 |  1    |
| 2             |  2018-01-01 |  2    |  
| 2             |  2018-01-02 |  1    |
| 2             |  2018-01-05 |  4    |

Target result

Updated table 1
| individual id | date        | mean(alpha) |
---------------------------------------------
| 1             |  2018-01-02 |  1          |
| 1             |  2018-01-03 |  1          |
| 2             |  2018-01-02 | 1.5         |
| 2             |  2018-01-03 | 1.5         |

This is simply the mean of all the values for this individual in table2, that occurred (date2) prior to (and including) the date. The result can be produced by the following mysql command, but I am unable to reproduce it in R.

update table1
            set daily_alpha_avg = 
      (select avg(case when date2<date then alpha else 0 end) 
      from table2
      where table2.individual_id= table1.individual_id
      group by individual_id);

My best guess so far is:

table1[table2, on = .(individual_id, date>=date2), 
          .(x.individual_id, x.date, bb = mean(alpha)), by= .(x.date, x.individual_id)]

or

table1[, daily_alpha_avg := table2[table1, mean(alpha), on =.(individual_id, date>=date2)]]

but this isnt working, I know its wrong I just dont know how to fix it.

Thanks for any help

@sindri_baldur you are correct, I have updated it. many thanks — Laurence_jj, Apr 16 '19 at 13:22

s_baldur · Accepted Answer · 2019-04-17T08:20:03.287

5

Using by = .EACHI you could do something like the following:

table2[table1, 
       on = .(`individual id`), 
       .(date = i.date, mean_alpha = mean(alpha[date2 <= i.date])),
       by = .EACHI]

#    individual id       date mean_alpha
# 1:             1 2018-01-02        1.0
# 2:             1 2018-01-03        1.0
# 3:             2 2018-01-02        1.5
# 4:             2 2018-01-03        1.5

Edit:

# Assign by reference as a new column
table1[, mean_alpha := table2[table1, 
                              on = .(`individual id`), 
                              mean(alpha[date2 <= i.date]),
                              by = .EACHI][["V1"]]]

Edit 2:

Here is slightly more elegant way suggested by Frank in the comment section.

# In this solution our date columns can't be type character
table1[, date := as.Date(date)]
table2[, date2 := as.Date(date2)]

table1[, mean_alpha := table2[table1, # or equivalently .SD instead of table1
                              on = .(`individual id`, date2 <= date), 
                              mean(alpha), 
                              by = .EACHI][["V1"]]]

Reproducible data

table1 <- fread(
  "individual id | date       
   1             |  2018-01-02
   1             |  2018-01-03
   2             |  2018-01-02
   2             |  2018-01-03", 
  sep ="|"
)
table2 <- fread(
  "individual id | date2       | alpha
   1             |  2018-01-02 |  1     
   1             |  2018-01-04 |  1.5 
   1             |  2018-01-05 |  1   
   2             |  2018-01-01 |  2     
   2             |  2018-01-02 |  1   
   2             |  2018-01-05 |  4",
  sep = "|"
)

edited Apr 17 '19 at 08:20

answered Apr 16 '19 at 12:49

s_baldur

29,441
4
36
69

your answer is exactly what I am looking for!! the only issue is that I am getting a new table that is not mapped onto my table1. Ideally I was hoping to create a new column on my table1 with the results and na or zero if no data available. I am also getting only NA's in the output column at the moment, but that might be my data – Laurence_jj Apr 16 '19 at 13:24
@Laurence_jj Added an alternative solution that maps the calculations to a new column.. should give you `NA` if no match... – s_baldur Apr 16 '19 at 13:32
may I ask what the `[["V1"]]` does? – Laurence_jj Apr 16 '19 at 13:43
1

@Laurence_jj it's just base R syntax to extract one column from a data.frame. `table2[table1, on = .("individual id"), mean(alpha[date2 <= i.date]), by = .EACHI]` will return a `data.table` but I just want that one column. – s_baldur Apr 16 '19 at 13:44
last thing, when I try and replace the mean with a count command (to count the instances) I get the error `Error in UseMethod("groups") : no applicable method for 'groups' applied to an object of class "c('double', 'numeric')"`, do you know how to fix it? – Laurence_jj Apr 16 '19 at 14:04
1

@Laurence_jj what did you try? Using `sum(date2 <= i.date)` instead of `mean(alpha[date2 <= i.date])` I get the number of matches. – s_baldur Apr 16 '19 at 14:09
I tried using `count(apha[date2 <= i.date])` but it errored, you are quite correct and using `sum(date2 <= i.date)` works perfectly. Thanks again so much!! – Laurence_jj Apr 16 '19 at 14:13
3

If you overwrite-convert the dates with `as.IDate` or `as.Date`, then `table1[, v := table2[.SD, on=.(\`individual id\`, date2 <= date), mean(alpha), by=.EACHI]$V1]` also works. Btw, great answer :) – Frank Apr 16 '19 at 18:17
many thanks for the update @Frank, I have a follow up question . By the way I have a follow up question with a similar set up if you are interested: https://stackoverflow.com/questions/55723046/r-data-table-if-then-sumif-lookup-using-join – Laurence_jj Apr 17 '19 at 08:26

Bruno · Answer 2 · 2019-04-16T13:02:00.717

Is tidyverse performance not enough for you?

I couldn't replicate your table with just date2 < date so I added =.

#Please provide 

table1 <- tribble(~individual_id,~date,
                  1,"2018-01-02",
                  1,"2018-01-03",
                  2,"2018-01-02",
                  2,"2018-01-03")

table2 <- tribble(~individual_id,~date2,~alpha,
                  1,"2018-01-02",1,
                  1,"2018-01-04",1.5,
                  1,"2018-01-05",1,
                  2,"2018-01-01",2,
                  2,"2018-01-02",1,
                  2,"2018-01-05",4)

df <- left_join(table1,table2) %>%
  mutate(date = as.Date(date),
         date2 = as.Date(date2))

df %>% 
  group_by(individual_id,date) %>% 
  mutate(case = ifelse(date2<=date,alpha,NA)) %>% 
  summarise(mean_alpha = mean(case,na.rm = TRUE))

You have the option of using the tidyverse to generate sql queries as well and there are sql_translations, check out https://dbplyr.tidyverse.org/articles/sql-translation.html and use the show_query function to be sure you are using the same logic between sql and R

score -1 · Answer 3 · answered Apr 16 '19 at 13:23

-1

Just use sqldf package , and put your query inside sqldf().

library(sqldf)
sqldf("your SQL goes here")
table1

That's it

answered Apr 16 '19 at 13:23

Sayandip Ghatak

191
2
14

I tried this, but it is very slow and I gave up after 10m or 20m – Laurence_jj Apr 16 '19 at 13:26
1

then use tidyverse package – Sayandip Ghatak Apr 16 '19 at 13:27

R data.table average if lookup using join

3 Answers3