0

My RAM is 4 GB and I am using a 64-bit Windows and R. I want to merge around 25 data frames on the basis of common key column (Date). I searched on internet and and various forums of stack overflow. However, I am getting the error "Error: cannot allocate vector of size x Gb ". The data.frames looks like

A <- data_frame(Date = c("1/1/1990","1/2/1990","1/3/1990"), MXAR = c(342.369,342.369,525.528 ))
B <- data_frame(Date = c("11/30/2009","12/1/2009","12/2/2009"), MXBD = c(1000,1007.13,998.55 ))
C <- data_frame(Date = c("6/1/2005","6/2/2005","6/6/2005"), MXBH = c(1209.21,1205.74,1195.33 ))

I have around 25 data.frames like A, B and C (each having around 3000 rows). I am using the solutions available as shown below on stack overflow. A, B and C data.frames shows NA's in the last rows. However, they keep on throwing errors. Kindly help. filenames=list.files(path=path, full.names=TRUE)# files are type of A, B and C #discussed above

datalist = lapply(filenames, function(x){read_excel(x,sheet=1)})
df6<-datalist %>% Reduce(function(dtf1,dtf2) inner_join(dtf1,dtf2,by="Date"), .)

and

file_list <- list.files()

for (file in file_list)
{


  # if the merged dataset does exist, append to it
  if (exists("dataset")){
    temp_dataset <-read_excel(file)

    dataset<-merge.data.frame(dataset, temp_dataset, by=c('Date'))
    rm(temp_dataset)
  }
# if the merged dataset doesn't exist, create it
  if (!exists("dataset")){
    dataset <- read_excel(file)

  }
}

and

reshape::merge_all(datalist, by="Date")

When running

datalist = lapply(filenames, function(x){data.table(read_excel(x,sheet=1), key="Date")})

The program throws many warnings like

In read_xlsx_(path, sheet, col_names = col_names, col_types = col_types,             ... :
  [1601, 1]: expecting date: got '[NULL]'

A, B and C data.frames shows NA's in the last rows. Is there way to remove these NA's in the above lines of code and then running the program. May that removes errors.

gaurav kumar
  • 859
  • 2
  • 10
  • 24
  • Could be a duplicate of http://stackoverflow.com/questions/11146967/efficient-alternatives-to-merge-for-larger-data-frames-r – mpjdem Jan 25 '17 at 11:12
  • If that does not help, I would suggest splitting the data frames by date, merging these subsets separately and if needed and possible `rbind` them at the end into a big frame. – mpjdem Jan 25 '17 at 11:13

2 Answers2

3

Seems like you are having several problems from your description and errors. And this may not be a simple solution of code, but a series of confounding problems, so I will do my best without seeing your code crash and burn myself.

You will need to be very crafty to get everything done within a loop without creating bloat.

First thing you should do, if you are merging on dates, is check each frame (prior to merging) using str(df) to see that the date fields are all both identical in their structure AND their data type. It is not uncommon for date times to come in as factors and sometimes even character data.

If they are not all the same, create a function (test it alone before running it in a loop) in the merge loop to go through each file$date field and convert to a uniform date time like POSIXct.

You may need to use if statements to illicit whether they are dates, nums, chars or Factors and then apply the right as. to each but the absolutely MUST be exact in every way if you want the merge to work. ( examples: as.date or as.POSIXct)

If any are factors you need to nest your conversions keep from weird coercion. You can use:

as.POSIXct(as.character(df$col), %m/%d/%Y)

Once you merge your new file to your big file, use rm() (with your file name inside parentheses of the rm statement), to get rid of the newly imported file in the environment, keeping only the merged file.

Then you need to run gc() to clear the removed file from memory. If you do not do this, it may persist in memory and affect the next file.

Some things to be aware of, R data frames exist in 2-4 copies in memory during many duplicating processes. If those files are big, and you do not purge them with rm(df) and gc() you will definitely have issues.

Also, in working with Excel files direct you are more than likely using a JAVA interface which has its own heap and takes up memory too. And R will not tell its JAVA worker to purge as efficiently as anyone might like, the system will only do so if it cannot function, so you end up with an ever shrinking amount of accessible memory for merging.

Make sure you include the rm() and gc() in the loop after the merge. Because there are many files, consider using a print(yourfilename.type) statement to return the name of failed merges with a next() statement to go the the next file if the current iteration fails.

This way you keep the loop running after hitting a file it will not merge so you can build a list of failed files and start chasing down how they are different to fix the issues. Sometimes it is easier to do so in excel (if a column needs to be 6 numbers wide with preceding zeros, format it that way manually).

AND...if you are able to use CSV instead of Excel files (if they are single worksheets, not layered) convert them and you will free up RAM. Or consider writing one script to turn a workbook into several worksheets and save them as csv files to trim the fat out of the files AND limit your dependency on JUGS and Java in processing.

Probably not the answer you wanted, but it is a start.

sconfluentus
  • 4,693
  • 1
  • 21
  • 40
0

Data is imported from excel files in to R program. This creates lot of NA's in the last rows. By removing NA's using the code below, saves a lot of memory and help in merging data.frames efficiently. The codes removes NA from a list of data.frames in one go.

filenames=list.files(path=path, full.names=TRUE)
datalist4 = lapply(filenames, function(x){read_excel(x,sheet=1)})
str(datalist4)
datalist5<-lapply(datalist4, na.omit)#removing NA row from each data.frames
gaurav kumar
  • 859
  • 2
  • 10
  • 24