2

I need to merge about 19000 dataframes in R by their dates, and then merge with one more dataframe afterwards. I need to have the data in separate columns to perform a loop that has regression modelling for each of the 19000 dataframes.

I tried using the merge command, but all of the dataframes share the same columns, "DATE" and "RET", and the merge command does not allow more than 3 duplicates because it is limited to only the name itself, the name with the .x at the end, and the name with the .y at the end. I want the columns to be numbered by the dataset that they come from. For example, I want there be "DATE" in the first column, and "RET1," "RET2," and "RET3," and so forth until 19000 or so. I've also tried figuring out how to rename all the columns by numbering them RET1, RET2, and so forth but I could not find a solution to that either.

This is the failure to rename the names by column:

for (j in 1:19938){
  colnames(HPR_Split_New[[j]]) <- c("DATE",j)
}

This is the failure to merge all the databases together:

merged.data.frame = Reduce(function(...) merge(..., by=c('DATE'), all=T), HPR_Split_New)

For the renaming failure, I expected the column names to be renamed to RET1, RET2, etc, but the code actually did nothing at all..

For the merging failure, it said that there were too many duplicates and it couldn't merge anymore. Here is one of the errors:

47: In merge.data.frame(..., by = c("DATE"), all = T) :
  column names ‘RET.x’, ‘RET.y’, ‘RET.x’, ‘RET.y’, ‘RET.x’, ‘RET.y’, ‘RET.x’, ‘RET.y’, ‘RET.x’, ‘RET.y’, ‘RET.x’, ‘RET.y’, ‘RET.x’, ‘RET.y’, ‘RET.x’, ‘RET.y’, ‘RET.x’, ‘RET.y’, ‘RET.x’, ‘RET.y’, ‘RET.x’, ‘RET.y’, ‘RET.x’, ‘RET.y’, ‘RET.x’, ‘RET.y’, ‘RET.x’, ‘RET.y’, ‘RET.x’, ‘RET.y’, ‘RET.x’, ‘RET.y’, ‘RET.x’, ‘RET.y’, ‘RET.x’, ‘RET.y’, ‘RET.x’, ‘RET.y’, ‘RET.x’, ‘RET.y’, ‘RET.x’, ‘RET.y’, ‘RET.x’, ‘RET.y’, ‘RET.x’, ‘RET.y’, ‘RET.x’, ‘RET.y’ are duplicated in the result"

Please help me. I've haven't use R before and I've been stuck on combining these dataframes for over a week.

  • Sounds like it would make more sense to stack them all in one big long dataset with a `source` column indicating the dataset the rows came from. You could always reshape to wide format later then if that was required for output. – thelatemail May 31 '19 at 00:16
  • Your edit doesn't change my comment. You can very easily rearrange a long file to a wide file in a single call without trying to do 19000 merge operations. – thelatemail May 31 '19 at 00:24
  • Hey thanks for responding. I need to have them in the wide format in order to do regression modeling. I've updated the initial post. I also have a version of the dataset that's all together in 3 columns, but I do not know how I could perform a loop linear regression function with the data in that format, or how to reshape it to a wide format from that. May you please tell me how? – Jonathan Fung May 31 '19 at 00:25
  • 2
    If you search stackoverflow for "reshape long to wide" this has been discussed at length here, e.g.: https://stackoverflow.com/questions/5890584/how-to-reshape-data-from-long-to-wide-format It may even be possible to run a regression using the long form data since your `source` column will just be a grouping variable. – thelatemail May 31 '19 at 00:26
  • Thanks so much for your comment. The function is exactly what I was looking for but didn't know existed. It's still processing after 2 hours though. – Jonathan Fung May 31 '19 at 03:14
  • Hi just wanted to say that the reshape() function didn't work well because I had 19k columns and it took 4 hours and a half. If you install the reshape2 package, however, you can use the dcast() function, and it reshaped the dataset in only a minute. Thanks again for the help! – Jonathan Fung Jun 01 '19 at 15:21

0 Answers0