I have one huge dataframe (400k rows by 8000 cols), another dataframe that's 400k rows by 3 cols, and a third dataframe that's 402k rows by 3 cols. Each df has 2 key columns. Label them:
dfBig
- biggest df
dfSmall1
- the df with 402k rows
dfSmall2
- the df with the same number of rows
when I try to compile all the information with the following dplyr
solution my RStudio session gets aborted:
dfCompile <- dfBig %>%
left_join(dfSmall1, by = c('Key1', 'Key2')) %>%
left_join(dfSmall2, by = c('Key1', 'Key2')) %>%
select(Key1, Key2, newCol1, newCol2, everything())
I can't even run dfBig %>% arrange(Key1, Key2)
without it blowing up. What I've done instead to make it work is:
# Grab only the keys to cut down on size
dfInfo <- dfBig %>% select(Key1, Key2)
# Merge other dfs to the info DF
dfInfo <- dfInfo %>%
left_join(payoutMap, by = c('Key1', 'Key2')) %>%
left_join(ranks, by = c('Key1', 'Key2')) %>%
select(-Key1, -Key2)
# Cbind the rest of the columns back in since I know they'll be in the same order (arrange(.) blows it up)
dfBig <- cbind(dfBig, dfInfo) %>%
select(Key1, Key2, newCol1, newCol2, everything())
It works, but it doesn't feel very eloquent, and I don't want to have to mess with it again if it breaks down with an even bigger dataframe. Can sqldf
or data.table
handle this better? Just want to understand the best path to go down before I start learning for myself.