I have three large data frames I want to merge using left_join
from the dplyr
library. The data frames share the same structure, they have 9859 observations and 4 variables and only the fourth differs. In the end, what I need is a data frame with six columns (3 shared variables and the 3 differing variables).
I have been able to do this in the past using 'tidyverse' and ‘dplyr’ but now RStudio keeps crashing and freezing my laptop. In addition, I can use left_join
with the first two data frames, but not when joining the third.
I was able to perform 'left_join' with multiple data frames and I think this might have something to do with my data, but I cannot figure out what. You can download file with the three data frames here
sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
The data frames seem to have the same structure
str(df.u)
'data.frame': 9859 obs. of 4 variables:
$ event_id : num 3.51e+09 3.51e+09 3.51e+09 3.51e+09 3.51e+09 ...
$ timestamp: POSIXct, format: "2017-08-08 20:38:37" "2017-08-08 20:38:37" "2017-08-08 20:38:37" "2017-08-08 20:38:37" ...
$ raster.id: chr "2017-08-08_20_40_10" "2017-08-08_20_40_10" "2017-08-08_20_40_10" "2017-08-08_20_40_10" ...
$ u_wind : num -1.28 -1.3 -1.31 -1.32 -1.32 ...
str(df.v)
'data.frame': 9859 obs. of 4 variables:
$ event_id : num 3.51e+09 3.51e+09 3.51e+09 3.51e+09 3.51e+09 ...
$ timestamp: POSIXct, format: "2017-08-08 20:38:37" "2017-08-08 20:38:37" "2017-08-08 20:38:37" "2017-08-08 20:38:37" ...
$ raster.id: chr "2017-08-08_20_40_10" "2017-08-08_20_40_10" "2017-08-08_20_40_10" "2017-08-08_20_40_10" ...
$ v_wind : num -1.52 -1.53 -1.53 -1.54 -1.54 ...
str(df.w)
'data.frame': 9859 obs. of 4 variables:
$ event_id : num 3.51e+09 3.51e+09 3.51e+09 3.51e+09 3.51e+09 ...
$ timestamp: POSIXct, format: "2017-08-08 20:38:37" "2017-08-08 20:38:37" "2017-08-08 20:38:37" "2017-08-08 20:38:37" ...
$ raster.id: chr "2017-08-08_20_40_10" "2017-08-08_20_40_10" "2017-08-08_20_40_10" "2017-08-08_20_40_10" ...
$ w_wind : num -0.02343 -0.00834 0.00273 0.01357 0.01842 ...
Here is the code I've tried so far, what works and what crashes:
library(tidyverse)
# this craches
dat.wind <- left_join(df.u, df.v, by=c('event_id', 'timestamp', 'raster.id')) %>% left_join(dat.wind, df.w, by=c('event_id', 'timestamp', 'raster.id'))
If I make it in two steps, the first won't crash, but the second will:
dat.wind <- left_join(df.u, df.v, by=c('event_id', 'timestamp', 'raster.id')) # doesn't crash
dat.wind2 <- left_join(dat.wind, df.v, by=c('event_id', 'timestamp', 'raster.id')) # crashes
I've also tried converting into a list and use the solution suggested by Paul Rougieux here
list(df.u, df.f, df.w) %>% reduce(left_join, by=c('event_id', 'timestamp', 'raster.id')) # also crahses
In this specific case, I can simply use data frame function to get the desired result, but this is part of a loop and can get more complex.
dat.wind <- data.frame('event_id' = df.u$event_id, 'timestamp' = df.u$timestamp, 'raster.id' = df.u$raster.id, 'u_wind' = df.u$u_wind, 'v_wind' = df.v$v_wind, 'w_wind' = df.w$w_wind)
# this is what I want
head(dat.wind)
event_id timestamp raster.id u_wind v_wind w_wind
1 3512002602 2017-08-08 20:38:37 2017-08-08_20_40_10 -1.277772 -1.520014 -0.023433736
2 3512002602 2017-08-08 20:38:37 2017-08-08_20_40_10 -1.295119 -1.526865 -0.008342839
3 3512002602 2017-08-08 20:38:37 2017-08-08_20_40_10 -1.305293 -1.531078 0.002726094
4 3512002602 2017-08-08 20:38:37 2017-08-08_20_40_10 -1.317489 -1.535781 0.013570182
5 3512002602 2017-08-08 20:38:37 2017-08-08_20_40_10 -1.324802 -1.538454 0.018419913
6 3512002602 2017-08-08 20:38:37 2017-08-08_20_40_10 -1.326861 -1.539239 0.019975858