2

I have a data frame consisting of apx. 20 Mio. lines! The table looks as follows:

Output_I
    cusip_id    price
    uidiso      100.5
    undnns      90.2
    xsodeid     45.5
    uidiso      99.5
    xsodeid     45.1
    undnns      90.0

Now I have a second data frame consisting of the cusip_id plus additional information:

ouput_II
    cusip_id    ISIN
    uidiso      xs987346325
    undnns      ch438763282
    xsodeid     xs937349494

I'd like to merge output_I with output_II based on cusip_id in order to get the following results:

    output_III
    cusip_id    price   ISIN
    uidiso  100.5   xs987346325
    undnns  90.2    ch438763282
    xsodeid 45.5    xs937349494
    uidiso  99.5    xs987346325
    xsodeid 45.1    xs937349494
    undnns  90.0    ch438763282

The challenge is the size of output_I which consists of apx. 20 mio. lines. I have tried following codes:

library(dplyr)
output_III= left_join(output_I, output_II, by="cusip_id")


library(data.table)
ldt = data.table::data.table(output_I)
rdt = data.table::data.table(output_II, key = c("cusip_id", "ISIN")
output_III= merge(ldt, rdt)

Is there probably a way to use dplyr and group_by?

Or is this not possible with such a big data frame?

I appreciate your feedback.

Jaap
  • 81,064
  • 34
  • 182
  • 193
Brian
  • 45
  • 4
  • 2
    I don't know why you think you would need `group_by`. Your `left_join` and `merge` code both look just fine. 20MM lines won't be limitation for R (though possibly for your computer). Why are you not happy with those results? I'd also recommend making sure `cusip_id` is a `character` class in both data frames. Also make sure you are not doing a cross join (unless you intend to be). See that `cusip_id` isn't duplicated in at least one of your data frames. – Gregor Thomas Apr 30 '18 at 14:08
  • The canonical dupe for this is the R-FAQ [How to merge data in R](https://stackoverflow.com/q/1299871/903061), but since you seem to be aware of those solutions I won't close as a duplicated until you've had a chance to describe what didn't work. – Gregor Thomas Apr 30 '18 at 14:09
  • 3
    @PaulEndymion but why would you `match` instead of the highly-optimized `data.table::merge` or `dplyr::left_join` functions? – Gregor Thomas Apr 30 '18 at 14:18
  • thank you. the problem is ressources of my computer. it says: Error: cannot allocate vector of size. How would the highly-optimized data.table::merge work? – Brian Apr 30 '18 at 14:23
  • Using data.table: first convert your dataframe to datatables with `setDT()`:`setDT(output_I)` & `setDT(output_II)` then you can join with `output_I[output_II, on = .(cusip_id), ISIN := ISIN][]` – Jaap Apr 30 '18 at 14:28
  • And also clear out any other objects in your workspace. In the code you have your start by duplicating your data `ldt = data.table(output_I)`. Now you have two copies of `output_I` eating up twice the memory. Use `rm()` to get rid of objects you don't actually need. Try to avoid making unnecessary copies. Jaap's suggested `setDT` code is good for converting things to `data.table` without making a copy. – Gregor Thomas Apr 30 '18 at 14:37
  • Additionally: [See also this answer](https://stackoverflow.com/a/34600831/2204410) about memory efficient left join with data.table I posted a while ago. – Jaap Apr 30 '18 at 14:50
  • @Jaap: great explanation. Additionally, I had to remove doublicates in outputt_II in order to smoothly run the code (it did not work with doublicate in output_II). I did construct the code as you propose and it runs. great! – Brian Apr 30 '18 at 15:12
  • @Gregor I did learn something today – Paul Endymion Apr 30 '18 at 16:18

0 Answers0