1

I have three large data frames I want to merge using left_join from the dplyr library. The data frames share the same structure, they have 9859 observations and 4 variables and only the fourth differs. In the end, what I need is a data frame with six columns (3 shared variables and the 3 differing variables).

I have been able to do this in the past using 'tidyverse' and ‘dplyr’ but now RStudio keeps crashing and freezing my laptop. In addition, I can use left_join with the first two data frames, but not when joining the third.

I was able to perform 'left_join' with multiple data frames and I think this might have something to do with my data, but I cannot figure out what. You can download file with the three data frames here

sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

The data frames seem to have the same structure

str(df.u)
'data.frame':   9859 obs. of  4 variables:
 $ event_id : num  3.51e+09 3.51e+09 3.51e+09 3.51e+09 3.51e+09 ...
 $ timestamp: POSIXct, format: "2017-08-08 20:38:37" "2017-08-08 20:38:37" "2017-08-08 20:38:37" "2017-08-08 20:38:37" ...
 $ raster.id: chr  "2017-08-08_20_40_10" "2017-08-08_20_40_10" "2017-08-08_20_40_10" "2017-08-08_20_40_10" ...
 $ u_wind   : num  -1.28 -1.3 -1.31 -1.32 -1.32 ...
str(df.v)
'data.frame':   9859 obs. of  4 variables:
 $ event_id : num  3.51e+09 3.51e+09 3.51e+09 3.51e+09 3.51e+09 ...
 $ timestamp: POSIXct, format: "2017-08-08 20:38:37" "2017-08-08 20:38:37" "2017-08-08 20:38:37" "2017-08-08 20:38:37" ...
 $ raster.id: chr  "2017-08-08_20_40_10" "2017-08-08_20_40_10" "2017-08-08_20_40_10" "2017-08-08_20_40_10" ...
 $ v_wind   : num  -1.52 -1.53 -1.53 -1.54 -1.54 ...
str(df.w)
'data.frame':   9859 obs. of  4 variables:
 $ event_id : num  3.51e+09 3.51e+09 3.51e+09 3.51e+09 3.51e+09 ...
 $ timestamp: POSIXct, format: "2017-08-08 20:38:37" "2017-08-08 20:38:37" "2017-08-08 20:38:37" "2017-08-08 20:38:37" ...
 $ raster.id: chr  "2017-08-08_20_40_10" "2017-08-08_20_40_10" "2017-08-08_20_40_10" "2017-08-08_20_40_10" ...
 $ w_wind   : num  -0.02343 -0.00834 0.00273 0.01357 0.01842 ...

Here is the code I've tried so far, what works and what crashes:

library(tidyverse)
# this craches 
dat.wind <- left_join(df.u, df.v, by=c('event_id', 'timestamp', 'raster.id')) %>% left_join(dat.wind, df.w, by=c('event_id', 'timestamp', 'raster.id'))

If I make it in two steps, the first won't crash, but the second will:

dat.wind <- left_join(df.u, df.v, by=c('event_id', 'timestamp', 'raster.id')) # doesn't crash
dat.wind2 <- left_join(dat.wind, df.v, by=c('event_id', 'timestamp', 'raster.id')) # crashes

I've also tried converting into a list and use the solution suggested by Paul Rougieux here

list(df.u, df.f, df.w) %>% reduce(left_join, by=c('event_id', 'timestamp', 'raster.id')) # also crahses

In this specific case, I can simply use data frame function to get the desired result, but this is part of a loop and can get more complex.

dat.wind <- data.frame('event_id' = df.u$event_id, 'timestamp' = df.u$timestamp, 'raster.id' = df.u$raster.id, 'u_wind' = df.u$u_wind, 'v_wind' = df.v$v_wind, 'w_wind' = df.w$w_wind)
# this is what I want
head(dat.wind)
    event_id           timestamp           raster.id    u_wind    v_wind       w_wind
1 3512002602 2017-08-08 20:38:37 2017-08-08_20_40_10 -1.277772 -1.520014 -0.023433736
2 3512002602 2017-08-08 20:38:37 2017-08-08_20_40_10 -1.295119 -1.526865 -0.008342839
3 3512002602 2017-08-08 20:38:37 2017-08-08_20_40_10 -1.305293 -1.531078  0.002726094
4 3512002602 2017-08-08 20:38:37 2017-08-08_20_40_10 -1.317489 -1.535781  0.013570182
5 3512002602 2017-08-08 20:38:37 2017-08-08_20_40_10 -1.324802 -1.538454  0.018419913
6 3512002602 2017-08-08 20:38:37 2017-08-08_20_40_10 -1.326861 -1.539239  0.019975858
onlyphantom
  • 8,606
  • 4
  • 44
  • 58
FAmorim
  • 300
  • 2
  • 14
  • 3
    The `by` in the left join in all 3 of your data frames are identical. This means that there will be a 9859 matches for 9859 rows (essentially n^2). You are likely crashing due to memory issues. – Jake Kaupp Jul 20 '18 at 14:08
  • Thank you for your comment, it is now much clear how `left_join` works. – FAmorim Jul 21 '18 at 22:10

2 Answers2

1

You could try:

jdfs<-join_all(list(dfs), by = "", type = 'left', match = "first")

Salma Guzmán
  • 13
  • 1
  • 4
0

To expand on Jake's comment, this is not really a use case for left_join because there is no ID column that differs for each row. If you try to do this match, what will happen is that each row in the first data frame will be matched to 9859 others, and then each of the 97,199,881 rows will try to match with 9859 in the third match. Instead you could use bind_cols or mutate to just copy the columns if you want to keep using the pipe. Or, if you really want to use left_join, you can add a row id column to each data frame to make each row distinct and remove it afterwards:

Of course, if it is not expected that the first three columns only have one distinct value, then check your data source!

df.u %>%
  bind_cols(df.v['v_wind'], df.w['w_wind'])
df.u %>%
  mutate(v_wind = df.v$v_wind, w_wind = df.w$w_wind)
df.u %>%
  rowid_to_column() %>%
  left_join(rowid_to_column(df.v), by = c('rowid', 'event_id', 'timestamp', 'raster.id')) %>%
  left_join(rowid_to_column(df.w), by = c('rowid', 'event_id', 'timestamp', 'raster.id')) %>%
  select(-rowid)
Calum You
  • 14,687
  • 4
  • 23
  • 42
  • I can see that using `bind_cols` is much more adequate fot my purpose. Thank you for pointing this out! – FAmorim Jul 21 '18 at 22:13