R: faster merging of data frames

Question

I have just started using R and wrote the following code but this is taking about 40 mins to process, so I am sure this can be coded in a way that it runs faster.

Basically, I have one large data set (crsp_td_net) of about 7GB n size and a second smaller data set (ff_35f). Both contain trading dates. What I am trying to do is to fill in trading dates for each company in the first data set.

From my first data set, I am creating subsets of data based on a company index, and then merging each subset with the second data set based on trading dates. This merged data set is appended to the other companies data set and so on until at the end, I am left with a large data set with all the initial companies but with the missing trading days incorporated.

I am not sure at this point whether the fact that the data frame final keeps expanding is causing my loop to run slowly or whether the loop is itself coded inefficiently. I understand that vectorization of the data can help speed this up, but I am not sure how to do this here (the matrix size of the subsets of data keeps changing for each company). I am also not sure of the best way to use apply, sapply or lapply (if any of these can be used here) for this. I have browsed a few queries on R but was I have not found a way to go about this. I would very much appreciate an alternative snippet of code that can make the below run faster.

todo<-matrix(numeric(0), 0,4)

for (i in 1:7396) {
  final<- crsp_td_net %>% 
  filter(compid==i) %>% 
  merge(ff_35f,by="date_crsp",all=TRUE)

  final<-final%>% filter(between(date_crsp, 
                       as.Date(min(date_crsp_orig,na.rm="TRUE")), 
                       as.Date(max(date_crsp_orig, na.rm="TRUE")))) %>%
                arrange(date_crsp) %>% 
                mutate(cusip8dg_compustat = 
                        ifelse(is.na(cusip8dg_compustat), 
                         max(cusip8dg_compustat, na.rm="TRUE"), 
                         cusip8dg_compustat)) %>%
                mutate(compid = ifelse(is.na(compid), i, compid))%>%
                select(compid, cusip8dg_compustat, date_crsp, 
                       date_crsp_orig)%>%  
  distinct()

  todo<-bind_rows(todo,final)
}

Thanks in advance,

Dev

Thank you all for your response. I was unable to reply in the comment box due to limit on response, so I am adding to my original post. @P Lapointe, please find a reproducible data set (I have used integer values instead of actual dates) @eipi10 - I think you have understood what I am after and thanks for the code but I am not sure if it is missing something as it is prompting for an input (I have all relevant libraries). @Alistaire - I will indeed be facing memory problems as I perform more calculations to add to the original data set. Grateful for your suggestions on how to make the loop faster/an alternative to it, which would be very helpful to understand how they would be implemented in the example below.

many thanks

zz <- "compid  date_crsp 
1          1        2
2          1        3     
3          1        5   
4          2        3  
5          2        7 
6          2        9 
7            3        3
8            3        5
9          3        7
10         3        8"
crsp_td_net <- read.table(text=zz, header = TRUE)


xx <- "date_crsp 
1 1 
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10  10
11  11"

ff_35f <- read.table(text=xx, header = TRUE)

# I expect my final output to look like this:

yy<-"compid date_crsp
1   1   2
2   1   3
3   1   4
4   1   5
5   2   3
6   2   4
7   2   5
8   2   6
9   2   7
10  2   8
11  2   9
12  3       3
13  3       4
14  3       5
15  3       6
16  3       7
17  3       8"

output_wanted<-read.table(text=yy, header = TRUE)

df <- full_join(crsp_td_net, expand.grid(compid = unique(crsp_td_net$compid), date_crsp=unique(ff_35f$date_crsp))) 

todo<-array(numeric(),c(1,4,0))
todo<-matrix(numeric(0), 0,0)



for (i in 1:3) {

    final<- filter(crsp_td_net,compid==i)
    final<- mutate(final,date_crsp_orig=date_crsp)
    final<- merge(final,ff_35f, by="date_crsp",all=TRUE)
    final<- filter(final,between(date_crsp, min(date_crsp_orig, na.rm=TRUE),   max(date_crsp_orig, na.rm=TRUE)))
    final<- arrange(final,date_crsp)
    final<- mutate(final,compid = ifelse(is.na(compid), i, compid))
    final<- select(final,compid, date_crsp)
    final<- distinct(final)
    todo<-bind_rows(todo,final)
  }

I have modified the full_join example and it now runs but is not doing what I want it to do re merging each compid with unique trading days to fill in missing trading days in the first data set. I would very much appreciate any suggestion on this please.

The loop I wrote above works to give me exactly what I want, but I was wondering if there is a faster way to do this as I will have to loop over 7000 or so compid to create the large data set todo. This takes about 40 mins to run, so I wonder if there is a faster way to write this loop or an alternative to it.

Many thanks in advance

crsp_td_net$date_crsp_orig <-crsp_td_net$date_crsp

        df <- full_join(crsp_td_net, by="date_crsp", expand.grid(compid = unique(crsp_td_net$compid), date_crsp=unique(ff_35f$date_crsp)) )
    df<- df%>% filter(between(date_crsp, min(date_crsp_orig, na.rm=TRUE), max(date_crsp_orig, na.rm=TRUE)))
    df<- df%>%filter(!compid.x=="NA")%>% select(-compid.y)%>% distinct()%>%arrange(compid.x,date_crsp)

Please provide a small subset of the data and the expected result. This question is not reproducible.http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — Pierre Lapointe, Apr 15 '17 at 22:06
If you want the final data frame to have one row for every combination of company and trading date, then you can do: `df = full_join(crsp_td_net, expand.grid(company = unique(crsp_td_net$company), date_crsp=unique(ff_35df$date_crsp))`, where I've assumed that every company appears at least once in `crsp_td_net` and every trading date (`date_crsp)` appears at least once in `ff_35df`. You'll need to load the `dplyr` package to use the `full_join` function. — eipi10, Apr 15 '17 at 22:41
If your data is already 7Gb, a full join could quickly get too large to hold in memory, depending on the structure. Check the effects on a subset to see whether the results even make sense—incorrectly applied, full joins can insert a lot of useless repetition. — alistaire, Apr 15 '17 at 22:59
thanks for your quick replies. I have added to my initial posting above to illustrate the problem with a reproducible data set. — Dev V, Apr 16 '17 at 00:13

score 4 · Answer 1 · edited Apr 17 '17 at 09:33

Although the OP has asked for a dplyr solution, I can only suggest a solution which uses the foverlaps() function from the data.table package.

The OP has requested to complete the trading dates for each company in crsp_td_net by trading dates given in ff_35f. Completion means to fill up date ranges from a start date to an end date with given dates. (Note that the OP is using integer values in place of dates). The given dates can be considered to be date ranges as well where each range consists only of one day.

Now, the problem has been paraphrased to find the overlaps of two sequences of (date) ranges (overlap joins). For this, the foverlaps() function can be used which is inspired by findOverlaps()function of Bioconductor's IRanges package but works with non-genomic (i.e., non-integer) ranges as well.

library(data.table)
# coerce to data.table
setDT(crsp_td_net)
setDT(ff_35f)

# find start and end date for each company
comp_date_range <- crsp_td_net[, .(start = min(date_crsp), end = max(date_crsp)), 
                                         by = compid]

# turn given dates into date ranges of one day lengths 
# by adding an end column equal to the start dates
ff_35f[, end := date_crsp]

# set keys
setkey(comp_date_range, start, end)
setkey(ff_35f, date_crsp, end)

# find all overlapping ranges
temp <- foverlaps(comp_date_range, ff_35f)

# reorder result for convenience and pick desired columns
result <- temp[order(compid, date_crsp), .(compid, date_crsp)]

The result is in line with the expected output:

result
#    compid date_crsp
# 1:      1         2
# 2:      1         3
# 3:      1         4
# 4:      1         5
# 5:      2         3
# 6:      2         4
# 7:      2         5
# 8:      2         6
# 9:      2         7
#10:      2         8
#11:      2         9
#12:      3         3
#13:      3         4
#14:      3         5
#15:      3         6
#16:      3         7
#17:      3         8

This can be written more concisely in a single line of code:

foverlaps(
  setkey(setDT(crsp_td_net)[, .(start = min(date_crsp), end = max(date_crsp)), 
                            by = compid], start, end),
  setkey(setDT(ff_35f)[, .(date_crsp, end = date_crsp)], date_crsp, end)
)[order(compid, start), .(compid, date_crsp)]

Note The OP has replaced dates by integers in his Q. The data.table package offers Date and time classes with integer storage for fast sorting and grouping (see ?as.IDate).

@Uwe Block - thanks for this. This does exactly what I need and does it very fast - took only one second on my large data set. — Dev V, Apr 17 '17 at 01:50
@DevV If you are dealing with real dates you may find `data.tables`' `IDate` class useful which uses integer instead of double storage mode. Could save memory (as well as time) with your 7 Gb data set. — Uwe, Apr 17 '17 at 07:27
@Arun Yeah, minor point but it might teach people to use `foverlaps()` where appropriate. — Uwe, Apr 17 '17 at 10:05
@Uwe Block Thanks very much for a step-by-step explanation of the solution you have proposed. This is really helpful for a novice R user such as myself and I am now definitely going to pursue the use of foverlaps() - this is very fast. Thanks to all for your help - this is very much appreciated. — Dev V, Apr 17 '17 at 10:20

score 3 · Answer 2 · edited May 23 '17 at 12:25

Giving a second thought to this problem, I believe it can be solved at reasonable speed using data.tables' non-equi join. (I'm posting this a separate answer because the approach is quite different to foverlaps().)

library(data.table)

# coerce to data.table
setDT(crsp_td_net)
setDT(ff_35f)

# find start and end date for each company
comp_date_range <- crsp_td_net[, .(start = min(date_crsp), end = max(date_crsp)), 
                               by = compid]

# non equi join: the result contains only rows which fulfill the condition in on = ...
# by = .EACHI executes .SD for each group, returning matching rows for each date
# nomatch = 0 (inner join) skips dates without matching company
temp <- comp_date_range[ff_35f, on = c("start<=date_crsp", "end>=date_crsp"), 
                        .SD, by = .EACHI, nomatch = 0, allow.cartesian = TRUE]

# reorder result for convenience and pick desired columns
result <- temp[order(compid, start), .(compid, date_crsp = start)]

The result is in line with expected output

result
#    compid date_crsp
# 1:      1         2
# 2:      1         3
# 3:      1         4
# 4:      1         5
# 5:      2         3
# 6:      2         4
# 7:      2         5
# 8:      2         6
# 9:      2         7
#10:      2         8
#11:      2         9
#12:      3         3
#13:      3         4
#14:      3         5
#15:      3         6
#16:      3         7
#17:      3         8

Note that there is an implicit assumption that the range of dates given ff_35f covers the whole range of dates used in crsp_td_net. Otherwise, company trades would drop off the result.

Benchmark results

At the time of writing, three different solutions were posted. The OP has measured the elapsed times of all three solutions with his 7 Gb data set and reported the measured elapsed times:

1.12 seconds for the foverlaps() solution
1.41 seconds for the non-equi join solution
11.92 seconds for the tidyr/ dplyr solution

in comments here and here.

As I was quite surprised to find the foverlaps() solution to be faster than the non-equi joins so I ran some benchmarks with varying problem sizes using the microbenchmark package.

The problem size is given by the number of companies. For each company, the trading days are randomly sampled from a selection of 260 "dates" simulating one year without weekends (For details see code below). The data set contains about 130 rows per company on average.

As can be seen from the chart of my own benchmarks (note that both axes are in log scale)

foverlaps() is somewhat faster than non-equi joins for larger problem sizes while non-equi joins is the fastest method for smaller problem sizes. tidyr/ dplyr is almost always the slowest method and a magnitude slower on large problems.

Define function for benchmark runs of problem size `n_comp`

bm_run <- function(n_comp) {
  # define 1 year of trading dates, simulating weekends
  ff_35f <- sort(outer(1:5, 7*(0:51), `+`))
  # create tradings dates for each company
  crsp_td_net <- rbindlist(lapply(seq_len(n_comp), function(i) {
    # how many trading dates to sample for actual company?
    n_days <- sample(length(ff_35f), 1)
    # sample trading dates
    data.frame(compid = i,
               date_crsp = sort(sample(ff_35f, n_days)))
  }))
  # coerce to data.frame
  setDF(crsp_td_net)
  # turn vector of trading dates into data.frame
  ff_35f <- data.frame(date_crsp = ff_35f)
  # scale down number of repetitions with problem size
  n_times <- as.integer(scales::squish(1000*1000 / nrow(crsp_td_net), c(3, 1000)))
  print(sprintf("%i companies with a total of %i trading dates, %i runs", 
                n_comp, nrow(crsp_td_net), n_times))
  # do the benchmark runs for this problem size
  mb <- microbenchmark::microbenchmark(
    foverlaps = {
      foverlaps(
        setkey(setDT(crsp_td_net)[, .(start = min(date_crsp), end = max(date_crsp)), 
                                  by = compid], start, end),
        setkey(setDT(ff_35f)[, .(date_crsp, end = date_crsp)], date_crsp, end)
      )[order(compid, start), .(compid, date_crsp)]
    },
    non_equi_join = {
      setDT(crsp_td_net)[, .(start = min(date_crsp), end = max(date_crsp)), by = compid
                         ][setDT(ff_35f), on = c("start<=date_crsp", "end>=date_crsp"), 
                           .SD, by = .EACHI, nomatch = 0, allow.cartesian = TRUE
                           ][order(compid, start), .(compid, date_crsp = start)]
    },
    dplyr = {
      setDF(crsp_td_net)
      setDF(ff_35f)
      crsp_td_net %>%
        dplyr::group_by(compid) %>%
        dplyr::summarize(date_crsp = list(seq(from=min(date_crsp), to=max(date_crsp), by=1))) %>%
        tidyr::unnest() %>%
        dplyr::semi_join(ff_35f, by="date_crsp") %>%
        dplyr::arrange(compid, date_crsp)
    },
    times = n_times
  )
  # return problem size and timings as list
  return(list(n_comp, nrow(crsp_td_net), mb))
}

Run benchmark for different problem sizes

library(data.table)
library(magrittr)
# number of companies
n_comp <- outer(c(1,2), 10^(1:4), `*`)
# set seed of RNG for creation of reproducible data
set.seed(1234)
# do benchmark runs with different problem size derived from number of companies
bm <- lapply(n_comp, bm_run)

Prepare data for plotting

# create data.table with benchmark timinings from chunks in returned list
mbl <- rbindlist(lapply(bm, `[[`, i = 3), id = "n_row")
# aggregate results
mba <- mbl[, .(median_time = median(time), N = .N), by = .(n_row, expr)]
# reorder factor levels 
mba[, expr := forcats::fct_reorder(expr, -median_time)]
# replace chunk number by number of rows
mba[, n_row := unlist(lapply(bm, `[[`, i = 2))[n_row]]

Creat chart

library(ggplot2)
ggplot(mba, aes(n_row, median_time*1e-6, group = expr, colour = expr)) + 
  geom_point() + geom_smooth(se = FALSE) + 
  scale_x_log10(breaks = unique(mba$n_row), labels = scales::comma) + 
  scale_y_log10() +
  xlab("number of rows") + ylab("median of execution time [ms]") +
  ggtitle("microbenchmark results") + theme_bw()

@DevV I've noticed that you have benchmarked the two other answers. I'm curious to learn how the non-equi join compares. I expect this to be even faster. — Uwe, Apr 17 '17 at 11:15
After an error message, I added allow.cartesian=TRUE to the above and this executed in 1.41 (elapsed) seconds. By comparison, the foverlaps() solution took 1.12 (elapsed) seconds. — Dev V, Apr 17 '17 at 15:33
@DevV I was doing some benchmarks as well and came across the cartesian issue as well. To my surprise non-equi joins aren't faster than foverlaps(). — Uwe, Apr 17 '17 at 15:40
@Arun I did some benchmarks to compare `foverlaps()` with _non-equi joins_ and was quite surprised that `foverlaps()` beats _non-equi joins_ for large problem sizes. Any idea why? — Uwe, Apr 17 '17 at 18:15
Uwe, `foverlaps` is designed for a specific problem.. non-equi joins are more general + feature rich (combining `j` and `by`). It's therefore not really surprising. But in general, the difference shouldn't be that noticeable. There are optimisations that should be possible for certain common cases, but I'm not sure when I'll work on it. PS: I think, to make it clear, you should mention that axes are in log scale (by mentioning it in the labels).. — Arun, Apr 17 '17 at 20:27
@Arun Thank you for your explanations as well as for your suggestion concerning the log scales. — Uwe, Apr 18 '17 at 22:18

Andrew Lavers · Answer 3 · 2017-04-17T02:29:27.363

Adapted your data to use actual dates. In the data 2017-01-04 and -06 are not in the date table. This approach generates a sequence from the companies first and last date. On compid 2, the filling in of the missing dates can be seen. `seq.Date(from= , to=, by=1) makes the missing dates.

The unnest probably creates a large data frame, so there is some risk on memory, but if you keep operations on these tables to just be the compid and date_crsp then maybe it will fit.

semi_join and inner_join should both work - you want to test for speed.

 zz <- "compid  date_crsp 
1          1        2017-01-02
2          1        2017-01-03     
3          1        2017-01-05   
4          2        2017-01-03  
5          2        2017-01-07 
6          2        2017-01-09 
7          3        2017-01-03
8          3        2017-01-05
9          3        2017-01-07
10         3        2017-01-08"
crsp_td_net <- read.table(text=zz, header = TRUE)
library(lubridate)
crsp_td_net$date_crsp <- ymd(crsp_td_net$date_crsp)

xx <- "date_crsp 
1 2017-01-02 
2 2017-01-03
3 2017-01-05
4 2017-01-07
5 2017-01-08
6 2017-01-09
7 2017-01-10"

ff_35f <- read.table(text=xx, header = TRUE)
ff_35f$date_crsp <- ymd(ff_35f$date_crsp)

    library(dplyr)
    library(tidyr)

    crsp_td_net_summary <- crsp_td_net %>%
      group_by(compid) %>%
      summarize(date_crsp = list(seq.Date(from=min(date_crsp), to=max(date_crsp), by=1))) %>%
      unnest() %>%
      semi_join(ff_35f, by="date_crsp") %>%
      arrange(compid, date_crsp)

    crsp_td_net_summary

    # # A tibble: 12 × 2
    # compid  date_crsp
    # <int>     <date>
    # 1       1 2017-01-02
    # 2       1 2017-01-03
    # 3       1 2017-01-05
    # 4       2 2017-01-03
    # 5       2 2017-01-05
    # 6       2 2017-01-07
    # 7       2 2017-01-08
    # 8       2 2017-01-09
    # 9       3 2017-01-03
    # 10      3 2017-01-05
    # 11      3 2017-01-07
    # 12      3 2017-01-08

thanks. I could not get it to work on my actual data set - it threw the following error message: "Error in eval(substitute(expr), envir, enclos) : exactly two of 'to', 'by' and 'length.out' / 'along.with' must be specified" My data sets are as the above reproducible example, but date_crsp is an actual date variables in my data sets. I wonder if that could be the problem? — Dev V, Apr 17 '17 at 01:50
to clarify what I am trying to do. Every company in my data set has trading days, but these are not complete. If a company does not trade on one perfectly normal trading day, I want to make sure that missing date is part of the trading days data for that company. In other words, I need to expand the data for each company to fill in these missing trading days between the first day the company starts trading and the last day I observe it trading (or last date of the data download). Hope that makes sense. Thanks. — Dev V, Apr 17 '17 at 02:03
not every day is, so we can't use calendar days for filling these in. The second file (ff_35f) has all the actual trading days and I was trying to match with the latter. — Dev V, Apr 17 '17 at 02:10
Thank you for the revised code. These now work on my actual data set. For info, in terms of speed, your codes executed in 11.92 (elapsed) v/s 1.12 (elapsed) for the foverlaps() solution proposed above.Thank you both for very elegant solutions to my problem. — Dev V, Apr 17 '17 at 10:37