0

This is my transaction data. It shows the transactions made from the accounts in from column to the accounts in to column with the date and the amount information

data 

id          from    to          date        amount  
<int>       <fctr>  <fctr>      <date>      <dbl>
19521       6644    6934        2005-01-01  700.0
19524       6753    8456        2005-01-01  600.0
19523       9242    9333        2005-01-01  1000.0
…           …       …           …           …
1056317     7819    7454        2010-12-31  60.2
1056318     6164    7497        2010-12-31  107.5
1056319     7533    7492        2010-12-31  164.1

Now what I want to do is as follows: I want to go over every transaction row by row and for each account in from column, I want to keep track of how much transaction amount they received over last 6 month at the time that particular transaction was made and want to save this information as a new column.(so this new column will describe total transaction amount received by the accounts in from column in the last six months prior to the transaction date.)

For example:
In the very first line of data, for account 6644, I should look into the to column if 6644 got a transaction between the dates "2004-07-05"-"2005-01-01" which is a 6-month period up until the date 2005-01-01 which is the date when the transaction -we are at- is made by 6644. If there are such transactions that 6644 received, I should sum up them and add this sum information to a new column total_trx_amount_received_in_last_6month as a value. Similarly, I should do the same for account 6753 and look for the transactions it got between the dates "2004-07-05"-"2005-01-01" and sum them up and add this sum to the column total_trx_amount_received_in_last_6month. And, I should continue in this manner row by row in data.

So, how can I achieve this for the whole data?

PS: In the date interval "2004-07-05"-"2005-01-01", "2005-01-01" is the transaction date, to get the second date "2004-07-05" I subtract 180 days(approx. 6 months) from the transaction date "2005-01-01".

To see it better, I provide the following data:
I will also show how the output will be. Let's say we only have this many of transactions. Just consider 5370 account here, since other accounts 8605,6390,8934 don't receive any transactions here.

id          from    to          date        amount  total_trx_amount_received_in_last_6month 
<int>       <fctr>  <fctr>      <date>      <dbl>    <dbl>
18529       5370    9356        2005-05-31  24.4     0.0
13742       5370    5605        2005-08-05  7618.0   0.0
9913        5370    8567        2005-09-12  21971.0  0.0
956         8605    5370        2005-10-05  5245.0   0.0
2557        5370    5636        2005-11-12  2921.0   5245.0    
1602        6390    5370        2005-11-26  8000.0   0.0
18669       5370    8933        2005-11-30  169.2    (5245.0+8000.0)=13245
35900       5370    8483        2006-01-31  71.5     (5245.0+8000.0)=13245
48667       8934    5370        2006-03-31  14.6     0.0
51341       5370    7626        2006-04-11  4214.0   (8000.0+14.6)=8014.6

Here is what I did: First note that this small data above is sorted by date in ascending order.
In the first line, for account 5370 in from column I look at the past data to see if 5370 received any transaction between the dates "2004-12-02"-"2005-05-31". Since the first line is the very first transaction, obviously there's no transaction received by 5370 before the date "2005-05-31", so I log 0.0 into the total_trx_amount_received_in_last_6month column respectively. In the second line, for account 5370 in from column, again there's no transaction received by 5370 between the dates "2005-02-06"-"2005-08-05", so I log 0.0 into the total_trx_amount_received_in_last_6month column. Similarly, I log 0.0 in the 3rd and 4th lines for accounts 5370 and 8605 respectively. In the fifth line, for account 5370 in from column, there's a transaction received by 5370 between the dates "2005-05-16"-"2005-11-12", which is received in "2005-10-05"(in the 4th line of the data) with the amount 5245.0 so I log 5245.0 into the total_trx_amount_received_in_last_6month column. In the sixth line, for account 6390 in from column, there's no transaction received by 6390 between the dates "2005-05-30"-"2005-11-26", so I log 0.0 into the total_trx_amount_received_in_last_6month column. And it goes like this for all the rows of the data.

dput() output of the data:

structure(list(id = c(18529L, 13742L, 9913L, 956L, 2557L, 1602L, 
18669L, 35900L, 48667L, 51341L, 53713L, 60126L, 60545L, 65113L, 
66783L, 83324L, 87614L, 88898L, 89874L, 94765L, 100277L, 101587L, 
103444L, 108414L, 113319L, 121516L, 126607L, 130170L, 131771L, 
135002L, 149431L, 157403L, 157645L, 158831L, 162597L, 162680L, 
163901L, 165044L, 167082L, 168562L, 168940L, 172578L, 173031L, 
173267L, 177507L, 179167L, 182612L, 183499L, 188171L, 189625L, 
193940L, 198764L, 199342L, 200134L, 203328L, 203763L, 204733L, 
205651L, 209672L, 210242L, 210979L, 214532L, 214741L, 215738L, 
216709L, 220828L, 222140L, 222905L, 226133L, 226527L, 227160L, 
228193L, 231782L, 232454L, 233774L, 237836L, 237837L, 238860L, 
240223L, 245032L, 246673L, 247561L, 251611L, 251696L, 252663L, 
254410L, 255126L, 255230L, 258484L, 258485L, 259309L, 259910L, 
260542L, 262091L, 264462L, 264887L, 264888L, 266125L, 268574L, 
272959L), from = c("5370", "5370", "5370", "8605", "5370", "6390", 
"5370", "5370", "8934", "5370", "5635", "6046", "5680", "8026", 
"9037", "5370", "7816", "8046", "5492", "8756", "5370", "9254", 
"5370", "5370", "7078", "6615", "5370", "9817", "8228", "8822", 
"5735", "7058", "5370", "8667", "9315", "6053", "7990", "8247", 
"8165", "5656", "9261", "5929", "8251", "5370", "6725", "5370", 
"6004", "7022", "7442", "5370", "8679", "6491", "7078", "5370", 
"5370", "5370", "5658", "5370", "9296", "8386", "5370", "5370", 
"5370", "9535", "5370", "7541", "5370", "9621", "5370", "7158", 
"8240", "5370", "5370", "8025", "5370", "5370", "5370", "6989", 
"5370", "7059", "5370", "5370", "5370", "9121", "5608", "5370", 
"5370", "7551", "5370", "5370", "5370", "5370", "9163", "9362", 
"6072", "5370", "5370", "5370", "5370", "5370"), to = c("9356", 
"5605", "8567", "5370", "5636", "5370", "8933", "8483", "5370", 
"7626", "5370", "5370", "5370", "5370", "5370", "9676", "5370", 
"5370", "5370", "5370", "9105", "5370", "9772", "6979", "5370", 
"5370", "7564", "5370", "5370", "5370", "5370", "5370", "8744", 
"5370", "5370", "5370", "5370", "5370", "5370", "5370", "5370", 
"5370", "5370", "7318", "5370", "8433", "5370", "5370", "5370", 
"7122", "5370", "5370", "5370", "8566", "6728", "9689", "5370", 
"8342", "5370", "5370", "5614", "5596", "5953", "5370", "7336", 
"5370", "7247", "5370", "7291", "5370", "5370", "6282", "7236", 
"5370", "8866", "8613", "9247", "5370", "6767", "5370", "9273", 
"7320", "9533", "5370", "5370", "8930", "9343", "5370", "9499", 
"7693", "7830", "5392", "5370", "5370", "5370", "7497", "8516", 
"9023", "7310", "8939"), date = structure(c(12934, 13000, 13038, 
13061, 13099, 13113, 13117, 13179, 13238, 13249, 13268, 13296, 
13299, 13309, 13314, 13391, 13400, 13404, 13409, 13428, 13452, 
13452, 13460, 13482, 13493, 13518, 13526, 13537, 13542, 13544, 
13596, 13616, 13617, 13626, 13633, 13633, 13639, 13642, 13646, 
13656, 13660, 13664, 13667, 13669, 13677, 13686, 13694, 13694, 
13707, 13716, 13725, 13738, 13739, 13746, 13756, 13756, 13756, 
13761, 13769, 13770, 13776, 13786, 13786, 13786, 13791, 13799, 
13806, 13813, 13817, 13817, 13817, 13822, 13829, 13830, 13836, 
13847, 13847, 13847, 13852, 13860, 13866, 13871, 13878, 13878, 
13878, 13882, 13883, 13883, 13887, 13887, 13888, 13889, 13890, 
13891, 13895, 13896, 13896, 13899, 13905, 13909), class = "Date"), 
    amount = c(24.4, 7618, 21971, 5245, 2921, 8000, 169.2, 71.5, 
    14.6, 4214, 14.6, 13920, 14.6, 24640, 1600, 261.1, 16400, 
    3500, 2700, 19882, 182, 14.6, 16927, 25653, 3059, 2880, 9658, 
    4500, 12480, 14.6, 1000, 3679, 34430, 12600, 14.6, 19.2, 
    4900, 826, 3679, 2100, 38000, 79, 11400, 21495, 3679, 200, 
    14.6, 100.6, 3679, 5300, 108.9, 3679, 2696, 7500, 171.6, 
    14.6, 99.2, 2452, 3679, 3218, 700, 69.7, 14.6, 91.5, 2452, 
    3679, 2900, 17572, 14.6, 14.6, 90.5, 2452, 49752, 3679, 1900, 
    14.6, 870, 85.2, 2452, 3679, 1600, 540, 14.6, 14.6, 79, 210, 
    2452, 28400, 720, 180, 420, 44289, 489, 3679, 840, 2900, 
    150, 870, 420, 14.6)), row.names = c(NA, -100L), class = "data.frame")

(I turned from and to columns into character, since they had a huge number of levels the output would take so much space otherwise)

Leyla Alkan
  • 355
  • 3
  • 12
  • Why would this answer not work here? https://stackoverflow.com/a/63689794/3962914 Instead of group by `from` you do it for `to` ? – Ronak Shah Sep 03 '20 at 05:34
  • Yes it works but after then I want to add the new column into the data in a way that it describes a feature of ‘from’ column. If we just group by ‘to’ and do the computation like in your solution, that new column will describe a feature of ‘to’ column. I create new columns based on ‘from’ column, so every other features in my data also describe a feature of ‘from’ column. – Leyla Alkan Sep 03 '20 at 14:04
  • although we are doing computation considering `to` column but I think it is what you want unless I am confused. Can you check the output doing that and if it doesn't match your expectations can you update your post showing output of `to` column and your expected output with `from` ? – Ronak Shah Sep 03 '20 at 14:13
  • I manually showed how the output would be just above the dput output of the data. Have you seen that? I computed total_trx_amount_received_in_last_6month column based on ‘to’ column, but added that information based on ‘from’ column. – Leyla Alkan Sep 03 '20 at 14:22
  • I explained how I did the computations, can you please check? I hope it's better now – Leyla Alkan Sep 03 '20 at 14:53
  • dput does not generate reproducible output if there are pointer components. Convert it to a data frame first. Also make it smaller but large enough to illustrate problem. – G. Grothendieck Sep 03 '20 at 14:57
  • I did thanks @G.Grothendieck – Leyla Alkan Sep 03 '20 at 15:00
  • Is it clear now? @RonakShah – Leyla Alkan Sep 03 '20 at 15:47
  • 1
    How about this? `data %>% mutate(amt = map2_dbl(from, date, ~sum(amount[to == .x & between(date, .y - 180, .y)])))` . This is using `dplyr` and `purrr`. – Ronak Shah Sep 03 '20 at 23:47
  • Yes, this is exactly what I want. Could you please share this as an answer so that I can accept it? Thank you! – Leyla Alkan Sep 04 '20 at 01:45
  • Btw, instead of `sum` function, when I run `max` function, if the account has received any transaction in the specified time interval it returns the correct value, but if it hasn't it returns `-Inf` instead of `0` and prints lots of warnings like `Input amt is map2_dbl(...).no non-missing arguments to max; returning -InfProblem with mutate() input amt. x no non-missing arguments to max; returning -Inf` in the console. Whereas `mean` and `median` functions just return `NAN` and `NA` respectively w/o warnings if the account hasn't received any transaction in the specified time interval – Leyla Alkan Sep 04 '20 at 02:32
  • Yes, that is because it returns `numeric(0)` when there are no matches and `max(numeric(0))` returns `-Inf`. – Ronak Shah Sep 05 '20 at 04:18
  • That one answers about how much money sent by each account, here what I ask about is how much money is received by each account @E_net4thepostjanitor – Leyla Alkan Sep 07 '20 at 14:06

1 Answers1

1

We can use map2_dbl and take sum of amount that lie in the 6-month range.

library(dplyr)
library(purrr)

data %>% 
    mutate(amt = map2_dbl(from, date,
                ~sum(amount[to == .x & between(date, .y - 180, .y)])))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • It works on a small data set very well, but when I try it on a large dataset, it seems that it will run forever – Leyla Alkan Sep 06 '20 at 17:53
  • Is there an efficient way to do this? – Leyla Alkan Sep 07 '20 at 14:07
  • You can translate the same code to base R and `data.table` but I am not aware about other way of doing this. – Ronak Shah Sep 07 '20 at 15:00
  • I tried to speed up the code with `future::plan(strategy = multisession)` and modified the code as `data[, amt := furrr::future_pmap_dbl(list(from, date), ~sum(amount[to== .x & between(date, .y-180, .y)])) ]`, but the speed hasn't been improved at all. – Leyla Alkan Sep 08 '20 at 13:12