0

I have one huge dataframe (400k rows by 8000 cols), another dataframe that's 400k rows by 3 cols, and a third dataframe that's 402k rows by 3 cols. Each df has 2 key columns. Label them:

dfBig - biggest df

dfSmall1 - the df with 402k rows

dfSmall2 - the df with the same number of rows

when I try to compile all the information with the following dplyr solution my RStudio session gets aborted:

dfCompile <- dfBig %>%
  left_join(dfSmall1, by = c('Key1', 'Key2')) %>% 
  left_join(dfSmall2, by = c('Key1', 'Key2')) %>%
  select(Key1, Key2, newCol1, newCol2, everything())

I can't even run dfBig %>% arrange(Key1, Key2) without it blowing up. What I've done instead to make it work is:

# Grab only the keys to cut down on size
dfInfo <- dfBig %>% select(Key1, Key2)

# Merge other dfs to the info DF
dfInfo <- dfInfo %>%
  left_join(payoutMap, by = c('Key1', 'Key2')) %>% 
  left_join(ranks, by = c('Key1', 'Key2')) %>% 
  select(-Key1, -Key2)

# Cbind the rest of the columns back in since I know they'll be in the same order (arrange(.) blows it up)
dfBig <- cbind(dfBig, dfInfo) %>% 
  select(Key1, Key2, newCol1, newCol2, everything())

It works, but it doesn't feel very eloquent, and I don't want to have to mess with it again if it breaks down with an even bigger dataframe. Can sqldf or data.table handle this better? Just want to understand the best path to go down before I start learning for myself.

CoolGuyHasChillDay
  • 659
  • 1
  • 6
  • 21
  • 2
    I think `data.table` would be faster. Check out https://stackoverflow.com/questions/34598139/left-join-using-data-table/34600831 – Ronak Shah Jun 05 '20 at 00:25
  • Thanks, sorry for the dumb question but the notation has always sort of confused me. R will just know that I want to merge via `data.table` when I use brackets w/two dataframes? – CoolGuyHasChillDay Jun 05 '20 at 00:29
  • 1
    You need to convert dataframe to data.table first. Do it via `setDT(dfBig)`, `setDT(dfSmall1)` and `setDT(dfSmall2)` and then you can use the above link to join them. – Ronak Shah Jun 05 '20 at 00:45
  • 1
    This actually feels like a pretty solid workaround to me, if all other methods fail. – dpritch Jun 05 '20 at 14:40

2 Answers2

2

I believe using data.frame() would be fastest, but will only work properly when there is complete intersection between the two data frames you are joining. You just have to give it a key = such as in the example below (this might be what you want since you were left merging your larger longer data set onto your shorter data set, thus chopping off at least two unmatched values assuming you only have unique values):

dfCompile <- dfBig %>%
  data.frame(dfSmall1, key = c('Key1', 'Key2')) %>% 
  data.frame(dfSmall2, key = c('Key1', 'Key2')) %>%
  select(Key1, Key2, newCol1, newCol2, everything())

Otherwise, using the data.table R package is probably the way to go.

library(data.table)
setDT(dfBig)
setDT(dfSmall1)
setDT(dfSmall2)
dfCompile <- dfBig %>%
  merge(dfSmall1, by = c('Key1', 'Key2'), all = TRUE) %>% 
  merge(dfSmall2, by = c('Key1', 'Key2'), all = TRUE) %>%
  merge(Key1, Key2, newCol1, newCol2, everything())

Note: if you plan to use the method using the data.table package, in order to left join, as was in your example, you want to change all = TRUE to all.x = TRUE in your merge() functions.

I hope this helps!

Joshua Mire
  • 736
  • 1
  • 6
  • 17
  • Can you explain what are you trying to do with `dfBig %>% data.frame(dfSmall1, key = c('Key1', 'Key2'))` this line ? – Ronak Shah Jun 05 '20 at 00:49
  • Hey thanks that seems to work well. Side question, but since I'm not actually getting an error - just a "RStudio is terminated" due to the size - would my method work if I ran it through the terminal? I know it would take a while but I really just want it to work, I don't care if it has to run overnight. – CoolGuyHasChillDay Jun 05 '20 at 02:07
  • If it did not run in RStudio, I don't think it will work in R. However, since your initial question revolved around the fastest solution, I would advise you work in R as your scripts will run faster than in RStudio. Best wishes! – Joshua Mire Jun 05 '20 at 02:10
  • What is your rationale for the statement "I would advise you work in R as your scripts will run faster than in RStudio?" – dpritch Jun 05 '20 at 03:11
  • If speed is all OP cares about R will be faster. RStudio runs on top of R. I should probably change my phrasing away from 'advise' as I personally still work in RStudio when writing my code, even if its to be auto-ran in R. – Joshua Mire Jun 05 '20 at 03:17
  • How long do you think that it takes for RStudio to submit an expression to the R interpreter, a couple of microseconds? And are you suggesting that this speed is slower than other interfaces with R such as if you ran R through a terminal? Once the expression is submitted it should run at the same speed whatever the interface. I think it is mistaken to suggest to people to avoid RStudio if they want their code to run faster, unless you have evidence to the contrary. – dpritch Jun 05 '20 at 14:35
0

It sounds like the size of the data is too large to fit into memory and R is crashing. Here are a few ideas to work around it.

  1. It sounds like you've tried dplyr::left_join and data.table::merge, one other option would be to try base::merge, although admittedly it is a longshot if the others didn't work.

  2. Do you need all of the columns in your data -- can you remove some unneeded ones before performing the merge?

  3. Another approach you could take is to add swap space to your system. You can make your swap space as large as you want, but your application may crawl to a near-halt if too much swap is required.

  4. You try to use a package like disk.frame which performs the operation one chunk at-a-time. See https://diskframe.com/.

  5. Can you find a machine with more memory? You could rent a machine on a cloud provider such as AWS, GCP, Azure, or elsewhere.

dpritch
  • 1,149
  • 13
  • 15
  • 1
    Reducing the column size is definitely the key, I'm just in the exploratory phase of a new project and I really need to look at it all (cut down the 8k features from 40k). I could do it batches though, I should have just done that but I'm too stubborn to not do it all in one shot. Ok thanks I'll take a look at these, especially #5. That's been recommended to me before for other projects, I just need to learn the ins and outs on how to set it up. – CoolGuyHasChillDay Jun 05 '20 at 05:20