0

This is the followup question of my question here:

This is my transaction data

data 

id          from    to          date        amount  
<int>       <fctr>  <fctr>      <date>      <dbl>
19521       6644    6934        2005-01-01  700.0
19524       6753    8456        2005-01-01  600.0
19523       9242    9333        2005-01-01  1000.0
…           …       …           …           …
1055597     9866    9736        2010-12-31  278.9
1053519     9868    8644        2010-12-31  242.8
1052790     9869    8399        2010-12-31  372.2

Now, for each account in the from column, I want to compute how much transaction amount they received over last 6 month at the time the transaction was made. To do this:

df <- data # df is just a copy of "data"
setDT(df)[, total_trx_amount_received_in_last_6month := sapply(date, function(x) 
                         sum(amount[between(date, x-180, x)])), to] 

# since I want to merge "df" and "data" based on the columns "from" and "date", I change the name of the column "to" and make it "from"
df <- select(df, to,date,total_trx_amount_received_in_last_6month) %>% rename(from=to)

df

from    date        total_trx_amount_received_in_last_6month
<fctr>  <date>      <dbl>
7468    2005-01-04  700.0       
6213    2005-01-08  12032.0     
7517    2005-01-10  1000.0      
6143    2005-01-12  4976.0      
6254    2005-01-14  200.0       
6669    2005-01-20  200.0       
6934    2005-01-24  72160.0     
9240    2005-01-26  21061.0     
6374    2005-01-30  1000.0      
6143    2005-01-31  4989.4  

Now I want to add this new column total_trx_amount_received_in_last_6month into the original data. So, I should merge these two dataframes data and df by the columns from and date but the matching criteria for date is a range of values, not a single value. For example for account 7468 if the original data contains a transaction 7468 made and the transaction date falls into interval of "2004-07-08"-"2005-01-04"(which is the period of last 6 months, starting from "2005-01-04"), then corresponding value 700.0 in df$total_trx_amount_received_in_last_6month should be added to the data$total_trx_amount_received_in_last_6month

How can I do that?

Leyla Alkan
  • 355
  • 3
  • 12

1 Answers1

0

Don't have enough data to test this but you can join the two dataframes and replace total_trx_amount_received_in_last_6month where the difference between the two dates is greater than 180 days to NA.

library(dplyr)

data %>%
left_join(df, by = 'from') %>%
  mutate(total_trx_amount_received_in_last_6month = replace(
            total_trx_amount_received_in_last_6month, 
            (date.y - date.x) > 180, NA))

With data.table, you can do :

library(data.table)
setDT(data)
df1 <- df[data, on = 'from']

df1[, total_trx_amount_received_in_last_6month := replace(
  total_trx_amount_received_in_last_6month, 
  (date - i.date) > 180, NA)]
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • It should be less than 180. But it doesn't work, I guess there is some mistake since the computation isn't even finished, it gives memory allocation error: `Error: cannot allocate vector of size 1.2 Gb` – Leyla Alkan Sep 02 '20 at 01:36
  • The code keeps all the values where the difference is less than 180 and turns other values to `NA`. That error means you do not have enough memory for such large processing, check this post https://stackoverflow.com/questions/5171593/r-memory-management-cannot-allocate-vector-of-size-n-mb . I have updated the answer with `data.table` solution check if that helps. – Ronak Shah Sep 02 '20 at 02:03
  • It gives the error: `object 'i.date' not found`. Also shouldn't be `date` in `df1 <- df[data, on = c('from','date')]`? – Leyla Alkan Sep 02 '20 at 08:52
  • I think merging two dataset like this will not give us a proper result, since some information can be missing. – Leyla Alkan Sep 02 '20 at 14:11
  • I posted it as a new question, can you please check it? @Ronak Shah – Leyla Alkan Sep 02 '20 at 16:14
  • This should have been the same question actually. Anyway, is there a way you could share a small reproducible example to test the solution? – Ronak Shah Sep 02 '20 at 23:58
  • Yes, I did that in the new question – Leyla Alkan Sep 03 '20 at 00:56
  • A reproducible example is the one where we can copy-paste the data and use it on our R-session directly. Usually ti is done via `dput`. Here is a good guide which tells you how you can share data http://stackoverflow.com/questions/5963269 – Ronak Shah Sep 03 '20 at 01:03
  • I've added it, can you please check? – Leyla Alkan Sep 03 '20 at 01:41
  • Do you need anything else other than the data? @RonakShah – Leyla Alkan Sep 03 '20 at 03:07