11

I am new to R and this is my first question on stackoverflow.

I am trying

  • to assign by reference to a new column
  • for each row
  • using the value from the first next row within the same group of rows
  • that meets a condition.

Example data:

    id code  date_down    date_up
 1:  1    p 2019-01-01 2019-01-02
 2:  1    f 2019-01-02 2019-01-03
 3:  2    f 2019-01-02 2019-01-02
 4:  2    p 2019-01-03       <NA>
 5:  3    p 2019-01-04       <NA>
 6:  4 <NA> 2019-01-05 2019-01-05
 7:  5    f 2019-01-07 2019-01-08
 8:  5    p 2019-01-07 2019-01-08
 9:  5    p 2019-01-09 2019-01-09
10:  6    f 2019-01-10 2019-01-10
11:  6    p 2019-01-10 2019-01-10
12:  6    p 2019-01-10 2019-01-11

What I would like to do is

  • subset (group) by id
  • and for each row
  • find date_up for the first row further down,
  • where code = 'p' and date-up (of the row found) is greater than date-down for the row I am updating.

My expected result shall be:

    id code  date_down    date_up  founddate
 1:  1    p 2019-01-01 2019-01-02       <NA>
 2:  1    f 2019-01-02 2019-01-03       <NA>
 3:  2    f 2019-01-02 2019-01-02       <NA>
 4:  2    p 2019-01-03       <NA>       <NA>
 5:  3    p 2019-01-04       <NA>       <NA>
 6:  4 <NA> 2019-01-05 2019-01-05       <NA>
 7:  5    f 2019-01-07 2019-01-08 2019-01-08
 8:  5    p 2019-01-07 2019-01-08 2019-01-09
 9:  5    p 2019-01-09 2019-01-09       <NA>
10:  6    f 2019-01-10 2019-01-10 2019-01-11
11:  6    p 2019-01-10 2019-01-10 2019-01-11
12:  6    p 2019-01-10 2019-01-11       <NA>

I have tried many variants, using .SD, .N, creating a new column with DT[, idcount:= seq_leg(.N),by=id], but not really got anywhere. Any help greatly appreciated.

Also any good references to data.table :) Many thanks

Edit: I have edited the original data supplied to give a more subtle example, whereby row 10 is updated with data from row 12, because row 12 is in the id subset and meets the qualifying criteria. Row 11 does not meet the qualifying criteria and hence the data is not used to update row 10. Also included my first use of dput!

Example data as dput code:

dt <- structure(list(
id        = c(1L, 1L, 2L, 2L, 3L, 4L, 5L, 5L, 5L, 6L, 6L, 6L),
code      = c("p", "f", "f", "p", "p", "<NA>", "f", "p", "p", "f", "p", "p"),
date_down = structure(c(17897, 17898, 17898, 17899, 17900, 17901, 17903, 17903, 17905, 17906, 17906, 17906), class = "Date"),
date_up   = structure(c(17898, 17899, 17898, NA, NA, 17901, 17904, 17904, 17905, 17906, 17906, 17907), class = "Date")),
class     = c("data.table", "data.frame"),
row.names = c(NA, -12L))
setDT(dt)  # to reinit the internal self ref pointer (known issue)
R Yoda
  • 8,358
  • 2
  • 50
  • 87
OllieB
  • 113
  • 9
  • 1
    There are some dates in 2017 in your example data. Am I reading the problem incorrectly or should your sample data all be 2019? – krads Mar 09 '19 at 04:04
  • 1
    Yes, thanks for spotting the typos. Real data has many more columns. I'll correct the data. – OllieB Mar 09 '19 at 07:35
  • 2
    Welcome at SO! Please always add sample data as R code to make it easier for us to answer (eg. using `dput`). TXH! – R Yoda Mar 09 '19 at 08:07
  • Is your data guaranteed to be sorted? By which column(s)? – R Yoda Mar 09 '19 at 08:44
  • Thanks R Yoda! I will start doing that, appreciate the tip. The data is sorted by id, date_down and then date_up. I'm looking at 1-2 million rows, about 40 columns in total. – OllieB Mar 09 '19 at 14:43
  • @OllieB Sorted by which column(s) you (contained in the example data of your question)? – R Yoda Mar 09 '19 at 16:19
  • 2
    @R Yoda Yes, the data is sorted, ascending on id, then ascending on date_down if id is the same, then date_up ascending if the first two are the same. Although I handcrafted the example data, I think it is represented as such. – OllieB Mar 09 '19 at 20:11
  • More questions: Does `founddate` contain the `date_up` data of the found "p" row? Do you update **all** rows or only non-"p" rows? If you update all rows (incl. "p"): Is another update required since the `founddate` of "p"-rows can be interpreted as new `date_up`? – R Yoda Mar 10 '19 at 08:46
  • 2
    Thank you for taking such an interest in helping me - answers below Does founddate contain the date_up data of the found "p" row? Yes it does Do you update all rows or only non-"p" rows? - All rows should be updated if a subsequent row that matches the condition is found, no matter the 'p' status of code If you update all rows (incl. "p"): Is another update required since the founddate of "p"-rows can be interpreted as new date_up? No, this metadata is used for other purposes – OllieB Mar 10 '19 at 09:10
  • @OllieB May I edit your question to remove unnecessary sentences (esp. the first two paragraphs)? I want to sponsor a bounty but the question must be short and clear first... – R Yoda Mar 10 '19 at 22:01
  • Yes please, thanks for the help, sorry it's a bit of a mess, I've learnt as much about how to ask questions. – OllieB Mar 10 '19 at 22:18
  • I have restructured the question and opened a bounty. Please watch comments and possible answers to respond timely and to accept only an answer that really solves the described problem (do not modify the requirements in the question ;-) – R Yoda Mar 10 '19 at 23:11
  • 1
    OllieB Just a reminder to please consider marking an answer as accepted [by clicking on the green tick mark](https://stackoverflow.com/help/someone-answers) if it has solved your problem. Also, be aware the grace period (the time left to award the bounty) will expire in about 15 hours. In this case as @RYoda raised the bounty I'm not sure if it's him or you who will see the [+50 icon](https://meta.stackexchange.com/a/16067) next to the the answer, under the up/down vote arrows. If the answer has enough merit please consider clicking on that to award the bounty before the time expires. – krads Mar 18 '19 at 08:17
  • You answered my next question (down below) - before I had asked it :) thanks-as stated below this is a very complete answer and am happy to accept it. – OllieB Mar 18 '19 at 09:43
  • @RYoda or OllieB, just a reminder the grace period (the time left to award the bounty) will expire now in about 1 hour. If you intend to award the bounty you have to do that manually. There should be a +50 icon next to the the answer (under the up/down vote buttons) which you can click to award the bounty if you feel the answer has enough merit. – krads Mar 18 '19 at 21:37
  • It's not there for me I'm afraid, but the answer is worthy of it if @RYoda can award it. – OllieB Mar 18 '19 at 21:56
  • @krads I have just awarded the bounty to your answer. Thanks for investing so much time, that is the spirit we need at SO :-) – R Yoda Mar 18 '19 at 23:13

3 Answers3

15

Join a data.table to a subset of itself, by group, to get values from rows matching non-equal criteria.

Summary:

  • Below I show 5 working data.table solutions that were candidates to performance test against the OP's actual data set (1.4M records).

  • All 5 solutions use "non-equi" joins (using inequality to compare columns for the join) in the on clause.

  • Each solution is just a small progressive code change so it should be easy to follow along to compare different data.table options and syntax choices.

Approach

To work through data.table syntax for this I broke it into to the following steps for the OP's problem:

  1. Join the dt to a subset of itself (or another data.table for that matter).
  2. Select (and rename) the columns you want from either dt or the subset.
  3. Define the join criteria based on columns from dt compared to columns in the subset, including using "non-equi" (non-equal) comparisons.
  4. Optionally define whether first or last match should be selected when multiple matching records are found in the subset.

Solution 1:

# Add row numbers to all records in dt (only because you 
# have criteria based on comparing sequential rows):
dt[, row := .I] 

# Compute result columns (  then standard assignment into dt using <-  )
dt$found_date  <- 
            dt[code=='p'][dt,   # join dt to the data.table matching your criteria, in this case dt[code=='p']
                          .( x.date_up ),   # columns to select, x. prefix means columns from dt[code=='p'] 
                          on = .(id==id, row > row, date_up > date_down),   # join criteria: dt[code=='p'] fields on LHS, main dt fields on RHS
                          mult = "first"]   # get only the first match if multiple matches

Note in the join expressions above:

  • i in this case is your main dt. This way you get all records from your main data.table.
  • x is the subset (or any other data.table) from which you want to find matching values.

Result matches requested output:

dt

    id code  date_down    date_up row found_date
 1:  1    p 2019-01-01 2019-01-02   1       <NA>
 2:  1    f 2019-01-02 2019-01-03   2       <NA>
 3:  2    f 2019-01-02 2019-01-02   3       <NA>
 4:  2    p 2019-01-03       <NA>   4       <NA>
 5:  3    p 2019-01-04       <NA>   5       <NA>
 6:  4 <NA> 2019-01-05 2019-01-05   6       <NA>
 7:  5    f 2019-01-07 2019-01-08   7 2019-01-08
 8:  5    p 2019-01-07 2019-01-08   8 2019-01-09
 9:  5    p 2019-01-09 2019-01-09   9       <NA>
10:  6    f 2019-01-10 2019-01-10  10 2019-01-11
11:  6    p 2019-01-10 2019-01-10  11 2019-01-11
12:  6    p 2019-01-10 2019-01-11  12       <NA>

Note: You may remove the row column by doing dt[, row := NULL] if you like.

Solution 2:

Identical logic as above to join and find the result columns, but now using "assign by reference" := to create found_date in dt:

dt[, row := .I] # add row numbers (as in all the solutions)

# Compute result columns (  then assign by reference into dt using :=  

# dt$found_date  <- 
dt[, found_date :=   # assign by reference to dt$found_date 
            dt[code=='p'][dt, 
                          .( x.date_up ), 
                          on = .(id==id, row > row, date_up > date_down),
                          mult = "first"]]

In Solution 2, the slight variation to assign our results "by reference" into dt should be more efficient than Solution 1. Solution 1 calculated results the exact same way - the only difference is Solution 1 used standard assignment <- to create dt$found_date (less efficient).

Solution 3:

Like Solution 2 but now using .(.SD) in place of dt to refer to the original dt without naming it directly.

dt[, row := .I] # add row numbers (as in all the solutions)
setkey(dt, id, row, date_down)  #set key for dt 

# For all rows of dt, create found_date by reference :=
dt[, found_date := 
            # dt[code=='p'][dt, 
            dt[code=='p'][.(.SD),   # our subset (or another data.table), joined to .SD (referring to original dt)
                          .( x.date_up ), 
                          on = .(id==id, row > row, date_up > date_down),  
                          mult = "first"] ]  

.SD above references back to the original dt that we are assigning back into. It corresponds to the subset of data.table that contains the rows selected in the first dt[, which is all the rows because we didn't filter it.

Note: In Solution 3 I used setkey() to set the key. I should have done that in Solution 1 & Solution 2 - however I didn't want to change those solutions after @OllieB tested them successfully.

Solution 4:

Like Solution 3 but using .SD once more than previously. Our main data.table name dt now appears only once across our entire expression!

# add row column and setkey() as previous solutions

dt[, found_date :=
            # dt[code=='p'][.(.SD), 
            .SD[code=='p'][.SD,   # .SD in place of dt at left!  Also, removed .() at right (not sure on this second change)
                           .(found_date = x.date_up),
                           on = .(id==id, row > row, date_up > date_down),
                           mult = "first"]]

With the change above our data.table name dt appears only once. I like that a lot because it makes it easy to copy, adapt and reuse elsewhere.

Also note: Where I'd previously used .(SD) I've now removed the .() around .SD because it doesn't appear to require it. However for that change I'm not sure if it has any performance benefit or whether it's data.table preferred syntax. I would be grateful if anyone can add a comment to advise on that point.

Solution 5:

Like previous solutions but making use of by to explicitly group subsets over operations when joining

# add row column and setkey() as previous solutions

dt[, found_date :=
       .SD[code=='p'][.SD,
                      .(found_date = x.date_up),
                      # on = .(id==id, row > row, date_up > date_down),
                      on = .(row > row, date_up > date_down),  # removed the id column from here
                      mult = "first"]
   , by = id]   # added by = id to group the .SD subsets 

On this last solution I changed it to use the by clause to explicitly group the .SD subsets on id.

Note: Solution 5 did not perform well against OllieB's actual data compared to Solutions 1 - 4. However, testing my own mock data I found that Solution 5 could perform well when the number of unique groups from the id column were low:
- With only 6 groups in 1.5M records this solution worked just as fast as the others.
- With 40k groups in 1.5M records I saw similar poor performance as OllieB reported.

Results

Solutions 1 - 4 performed well:

  • For 1.45M records in OllieB's actual data each of Solutions 1 to 4 were all 2.42 seconds or less "elapsed" time according to OllieB's feedback. Solution 3 appears worked fastest for OllieB having "elapsed=1.22" seconds.

  • I personally prefer Solution 4 because of the simpler syntax.

Solution 5

  • Solution 5 (using by clause) performed poorly taking 577 seconds for OllieB's testing on his real data.

Versions used

data.table version: 1.12.0

R version 3.5.3 (2019-03-11)


Possible further improvements:

  • Changing the date fields to integer may help join more efficiently. See as.IDate() to convert dates to integer in data.tables.
  • The setkey() step may no longer bee needed: As explained here by @Arun due to on envoking [often] more efficient secondary indicies and auto indexing.

References to data.table

As part of your question you've asked for "any good references to data.table". I've found the following helpful:

Importantly note this answer by @Arun which explains "the reason for implementing on= argument" suggests it may no longer be necessary to set keys any more:

It is therefore essential to figure out if the time spent on reordering the entire data.table is worth the time to do a cache-efficient join/aggregation. Usually, unless there are repetitive grouping / join operations being performed on the same keyed data.table, there should not be a noticeable difference.

In most cases therefore, there shouldn't be a need to set keys any more. We recommend using on= wherever possible, unless setting key has a dramatic improvement in performance that you'd like to exploit.


As always I'm grateful if anyone has suggestions as perhaps this can be improved further.

Please feel free to comment, correct or post other solutions if you can add anything.

krads
  • 1,350
  • 8
  • 14
  • 2
    Thanks Krads.But I think there are a couple of problems with the solution, one that is easily fixable-no clause to check date_up of found row is> date_down of original row, easy to fix.The big problem could be something to do with my original wording not being clear enough.What I am trying to do is find the subset first - ie.the first occurrence, in a subset, if a condition is met, that is after the row we are updating. I think this solution only finds the data is the next row meets the condition, but not if it doesn't, even if there is a later row in the subset that does. – OllieB Mar 10 '19 at 17:09
  • I think I may be able to change the on condition to match what you want but I will have to test it later. In the meantime if you can edit your original question to include the test cases that you describe it will be helpful as input to that and for others that might be able to step in with other solutions. – krads Mar 10 '19 at 17:38
  • Please see revised answer. – krads Mar 11 '19 at 01:17
  • 2
    I think that's it, won't be able to check until tonight now - but looks it on the first read. As always a million thanks for spending your time on this. Back later. – OllieB Mar 11 '19 at 08:48
  • Yes! It works, thanks @krads, so appreciated - really helped me on my learning journey, and it's so simple when you know how to do it :) (and you get a well-articulated question). And thanks everyone else who came to my rescue with tips along the way. – OllieB Mar 11 '19 at 19:34
  • Pretty answer :-) Please note that this answer does **not** update the `data.table` by reference but creates a new one you have to store in a new variable. IMHO this OK and you could even add only the derived column `found_date` to the result and use this column add it to the original `dt` by reference (to optimize the performance). – R Yoda Mar 11 '19 at 19:52
  • 2
    @RYoda: As you suggest I've updated the solution to add only the one derived column (removing the unnecessary example of how to return multiple columns from the subset). Makes it much cleaner/clearer for this problem so thank you! Still looking for a way to update by reference... – krads Mar 12 '19 at 09:49
  • 1
    Added 'Solution 2' using update "by reference" to add the result set into the original `dt` – krads Mar 12 '19 at 11:58
  • @OllieB krads has added a second solution, it would be great if you give us a short indication how long solution 1 and 2 (and possible the other answers) are running to calculate the result (because only you have more than 12 rows of data ;-) – R Yoda Mar 12 '19 at 18:15
  • 3
    @krads - thanks - this is definitely the solution I was looking for (either will do me). It has set me off doing some other funky stuff that I also need, such as counts and concatenation of columns found :) I have run system.time on both of these solutions, against 1.45m rows and 30 column data.table and the results are right where I need them. The first is: `user=1.70 system=0.86 elapsed=2.42` & the 2nd is: `user=1.49 system=0.64 elapsed=1.94` And for context, I have (or rather am) learnt/learning R to enable me to write code to out-perform .... – OllieB Mar 12 '19 at 20:25
  • 3
    ... a "high-performance data processing, analytics and reporting" which took over an hour to run the number of subsets I need to run. I anticipate it taking less than a minute in R. Thanks everyone .... so appreciated! – OllieB Mar 12 '19 at 20:25
  • 1
    One more improvement I intend to add to my answer is to set the keys on the data.table which should improve performance over your 1.45m record set. I'll do that later when I have time. – krads Mar 13 '19 at 01:02
  • @OllieB Thanks for indicting the performance. And just to be sure: The solutions run within 1.x seconds (not hours) for 1.45m rows, do they? – R Yoda Mar 13 '19 at 08:56
  • @OllieB I've added a Solution 3 which is like solution 2 but making use of .SD in the appropriate place. Also in Solution 3 I set the key for dt which for 1.45m records may help speed things further. – krads Mar 13 '19 at 11:52
  • I like 3rd solution, should be the fastest one also – jangorecki Mar 13 '19 at 12:44
  • 1
    Looks like you have hit the jackpot with that one: `user=0.72 system=0.48 elapsed=1.22 ` Thanks! Amending my code! – OllieB Mar 13 '19 at 14:48
  • @OllieB I've added Solution 4 & 5 for comparison hopefully improving use of .SD and in the last solution using by to explicitly group data & operations rather than relying on the on join to link the groups. I know you have a solution that you were happy worked fast enough for your real problem. However, if you have time I think it would be educational to do a benchmark comparison of the last 3 solutions. – krads Mar 16 '19 at 08:39
  • Solution 4 and 5 have been run against the same data set. Solution 4: `user=1.42 system=0.68 elapsed=2.21` Solution 5: `user=1.32 system=0.49 elapsed=2.01` – OllieB Mar 16 '19 at 13:21
  • I noticed that solution 5 is actually much more inefficient - I had copied and pasted the wrong code! Real `system.time` for solution 5: `user=499.09 system=317.94 elapsed=577.31 ` – OllieB Mar 16 '19 at 15:25
  • 1
    @OllieB I've updated my answer to indicate the range of timings you reported. Out of curiosity how many unique `id` values do you have in your real data? My own test results for Solution 5 showed good results for 1.5M records when there were only 6 unique id values but when I changed my mock set to 40k unique id values it tanked the time similar to the result you reported. – krads Mar 17 '19 at 18:24
  • 1
    @OllieB I've added references to data.table as requested as the last part of your question. – krads Mar 18 '19 at 01:21
  • Hi Krads - this is an amazing answer, so thank you. I am new to SO, so please let me know if there is anything else I can do to upvote this - as I am very appreciative. I have 50,184 unique `id` in my `data.table` – OllieB Mar 18 '19 at 09:36
  • Looking through, I missed a question from R Yoda. Yes, the solution is running in seconds. It truely is phenomenal – OllieB Mar 18 '19 at 09:37
1

A Not-the-Data-Table-Way approach:

> df <- structure(list(
+   id        = c(1L, 1L, 2L, 2L, 3L, 4L, 5L, 5L, 5L, 6L, 6L, 6L),
+   code      = c("p", "f", "f", "p", "p", "<NA>", "f", "p", "p", "f", "p", "p"),
+   date_down = structure(c(17897, 17898, 17898, 17899, 17900, 17901, 17903, 17903, 17905, 17906, 17906, 17906), class = "Date"),
+   date_up   = structure(c(17898, 17899, 17898, NA, NA, 17901, 17904, 17904, 17905, 17906, 17906, 17907), class = "Date")),
+   class     = c("data.frame"),
+   row.names = c(NA, -12L))
> 
> 
> Lista <- lapply(split(df, df$id), function(x){
+   x$founddate <- 
+     sapply(c(1:nrow(x)), function(y){
+       na.omit(sapply(y:nrow(x), function(i){
+         ifelse(x[i + 1, "code"] == "p" & x[i + 1, "date_up"] > x[y, "date_down"],
+                x[i + 1, "date_up"], NA)
+       }))[1]
+     })
+   x$founddate <- as.Date(x$founddate, origin = "1970-01-01")
+   return(x)
+ })
> 
> 
> df <- do.call(rbind.data.frame, Lista)
> 
> df
     id code  date_down    date_up  founddate
1.1   1    p 2019-01-01 2019-01-02       <NA>
1.2   1    f 2019-01-02 2019-01-03       <NA>
2.3   2    f 2019-01-02 2019-01-02       <NA>
2.4   2    p 2019-01-03       <NA>       <NA>
3     3    p 2019-01-04       <NA>       <NA>
4     4 <NA> 2019-01-05 2019-01-05       <NA>
5.7   5    f 2019-01-07 2019-01-08 2019-01-08
5.8   5    p 2019-01-07 2019-01-08 2019-01-09
5.9   5    p 2019-01-09 2019-01-09       <NA>
6.10  6    f 2019-01-10 2019-01-10 2019-01-11
6.11  6    p 2019-01-10 2019-01-10 2019-01-11
6.12  6    p 2019-01-10 2019-01-11       <NA>
> 

Under the given conditions, there are more than one match per row. The proposed answer gets the first match, but this can be modified.

Hope it helps.

  • 3
    I've left this one running for an hour now, against 1.5m rows. But not returned - I think against big data sets the data.table way is the way forward. Thanks for the time and help - all these answers are really helping me. – OllieB Mar 12 '19 at 21:12
  • 1
    @OllieB Your welcome! Sure, there always are many aproachs to the same problem. Also, the base solution could be speed up with some parallelization vía the parallel / doParallel package. – Santiago Capobianco Mar 14 '19 at 03:00
1

Here is a quick and dirty way which doesn't require much thinking on your part, and captures the first viable option in the subset and leaves an NA if non exists.

the do(f(.)) call evaluates the predefined function f on each subset of dt defined by the group_by statement. I would go translate that simple script into Rcpp for serious use.

library(dplyr)
f <- function(x){
  x <- x %>% mutate(founddate = as.Date(NA))

  for(i in 1:nrow(x)){
    y <- x[i, "date_down"]
    x[i, "founddate"] <-(x[-c(1:i),] %>% filter(code == "p", date_up > y) %>% select(date_up))[1, ]
  }

  return(x)
}

dt %>% group_by(id) %>% do(f(.))

# A tibble: 12 x 5
# Groups:   id [6]
      id code  date_down  date_up    founddate 
   <int> <chr> <date>     <date>     <date>    
 1     1 p     2019-01-01 2019-01-02 NA        
 2     1 f     2019-01-02 2019-01-03 NA        
 3     2 f     2019-01-02 2019-01-02 NA        
 4     2 p     2019-01-03 NA         NA        
 5     3 p     2019-01-04 NA         NA        
 6     4 <NA>  2019-01-05 2019-01-05 NA        
 7     5 f     2019-01-07 2019-01-08 2019-01-08
 8     5 p     2019-01-07 2019-01-08 2019-01-09
 9     5 p     2019-01-09 2019-01-09 NA        
10     6 f     2019-01-10 2019-01-10 2019-01-11
11     6 p     2019-01-10 2019-01-10 2019-01-11
12     6 p     2019-01-10 2019-01-11 NA 

Your Comment about terrible performance is unsurprising. I would personal message this if I knew how, but below is a Rcpp::cppFunction to do the same thing.

Rcpp::cppFunction('DataFrame fC(DataFrame x) {
                    int i, j;
                    int n = x.nrows();
                    CharacterVector code = x["code"];
                    DateVector date_up = x["date_up"];
                    DateVector date_down = x["date_down"];
                    DateVector founddate = rep(NA_REAL, n);

                    for(i = 0; i < n; i++){
                      for(j = i + 1; j < n; j++){
                        if(code(j) == "p"){
                          if(date_up(j) > date_down(i)){
                            founddate(i) = date_up(j);
                            break;
                          } else{
                            continue;
                          }
                        } else{
                          continue;
                        }
                      }
                    }
                    x.push_back(founddate, "founddate");
                    return x;
                    }')

dt %>% group_by(id) %>% do(fC(.))
Croote
  • 1,382
  • 1
  • 7
  • 15
  • 3
    Hi Croote - you're right, this is very readable - but looks like performance wise it is no match to the data.table way. Ran system.time on it and it is currently telling me it is 2% of the way through with 1hr remaining. .... that's 1.5m approx rows. Thanks for your time helping me - I am learning with every answer – OllieB Mar 12 '19 at 20:42
  • 1
    Hi OllieB, yes it is very slow. When I have run into these kinds of problems I tend to write the function in c++ through `Rcpp` to keep up with performance like what you see in `data.table` as I am not too familiar with `data.table`. – Croote Mar 12 '19 at 21:35
  • 1
    @OllieB I would greatly appreciate it if you could tell me how that `cppFunction` performs on your 1.5m rows. I have found `data.table` to be absurdly fast to compete with. – Croote Mar 12 '19 at 22:25
  • 1
    sorry I tried to run your code via RStudios and my laptop lockdown wouldn't let me complete the code, it started and downloaded some compiler packages but then kicked me out because of a permissions issue. It's a bit beyond where I am at at the moment :) – OllieB Mar 12 '19 at 22:46
  • 1
    No worries! I will try simulate some data :) – Croote Mar 12 '19 at 22:54