1

I have a large set of dataframes (around 50,000). Each dataframe have two columns, key and value, with around 100-200 rows. My question is essentially similar to this and this. Following their ideas, I construct a list of dataframes and use Reduce function

freq_martix<-Reduce(function(dtf1, dtf2) merge(dtf1, dtf2, by = "key", all = TRUE),
                    freq_list)

But my code has run for several days. I just wonder if there is a more efficient, faster way to merge a large set of dataframes?

Community
  • 1
  • 1
Ding Li
  • 673
  • 1
  • 7
  • 19
  • If you only have a key and value in each dataframe, then do you actually need to merge on the key? Can you not just concatenate the dataframes? Or is the value in each dataframe a different type of measurement? – Marius May 05 '17 at 02:09
  • Each dataframe may have overlapped keys. And more importantly, each dataframe represent each file, therefore, I need the columns (from column 2 onwards) in the merged dataframe to represent the original files. Thanks. – Ding Li May 05 '17 at 02:17
  • How long does your code run to combine two data.frames? – Adam Quek May 05 '17 at 02:27
  • What are the sizes of your dataframes? – yeedle May 05 '17 at 02:39
  • @DingLi: Concatenation might still be an option, like for each individual dataframe add a column `df$source = current_source`. Then concatenate all the individual frames using something like `bind_rows()`, and if needed reshape at the end. It may be faster than merging as there's less need to look up keys everytime, but hard to say without an idea of the data. – Marius May 05 '17 at 02:53
  • Just an idea. Why don't you merge the file _before_ loading them in R? If you are under Linux/Unix you could concatenate them. Then you can remove duplicates in a second step. Have you tried that? – Umberto May 05 '17 at 05:58
  • @AdamQuek it is pretty fast to combine just two dataframes. Within 1s. – Ding Li May 06 '17 at 08:47
  • @yeedle the sizes of individual dataframes vary. The total size of all dataframes is about 400 mb. – Ding Li May 06 '17 at 08:49
  • @Marius Thanks for this excellent suggestion. I will try it out. – Ding Li May 06 '17 at 08:51
  • @fasttouch I don't quite understand your idea. It is actually a text mining task on a large set of documents. So in the final dataframe, I need columns represent documents and rows represent word list. I don't think merging the files in the first step is what I want. – Ding Li May 06 '17 at 08:58

1 Answers1

0

This way is pretty fast. First of all I created 500 tables, each containing 150 key-value pairs.

library(data.table)
library(stringi)

for (i in 1:500) {
  set.seed(i)
  dfNam <- paste('df', i, sep = '_')
  df <- data.frame( cbind(key = tolower(stri_rand_strings(150, 1, pattern = '[A-Za-z]')), value = sample(1:1000, 150, replace = TRUE)) )
  assign(dfNam, df)
  rm(df)
  rm(dfNam)
}

Then I transposed and append them:

tmp <- data.table()
for (i in ls(pattern = 'df_') ) {
  df <- get(i)
  dt <- data.table( transpose(df) )
  colnames(dt) <- as.character(unlist(dt[1, ]))
  dt <- dt[-1, ]
  tmp <- rbindlist(list(tmp, dt), use.names = TRUE, fill = TRUE)
}

And transposed back after all:

merged_data <- transpose(tmp)
key <- colnames(tmp)
merged_data <- cbind(key, merged_data)

Works like charm.

sanyi
  • 41
  • 1
  • 6