3

I used the following function to merge all .csv files in my directory into one dataframe:

multmerge = function(mypath){
filenames = list.files(path = mypath, full.names = TRUE)
rbindlist(lapply(filenames,fread),fill = TRUE) }

dataframe = multmerge(path)

This code produces this error:

Error in rbindlist(lapply(filenames, fread), fill = TRUE) : Internal error: column 25 of result is determined to be integer64 but maxType=='character' != REALSXP

The code has worked on the same csv files before...I'm not sure what's changed and what the error message means.

A. Suliman
  • 12,923
  • 5
  • 24
  • 37
Audrey Liu
  • 35
  • 1
  • 4
  • 1
    Without seeing the files in your directory I cannot test this theory for sure but from the error message it appears that one of your .csv files has encoded column 25 as a character vector rather than numeric. Before you use rbindlist use an lappy or even sapply to get the class of each column of each `data.table`. If this is the case I could help you add a line of code in your function to check for this and coerce the problematic datafame column into the appropriate class before you rbind the list. Hope that helps! – Jason Johnson Apr 28 '19 at 23:19
  • 1
    Hi Jason! Thank you so much for your reply. It does sound possible that there is a problematic data frame column, but there are really too many files in the directory to check one by one. It would be great if you could help me add the code to check for this and coerce the column into the appropriate class! I've looked up solutions in several other places, but I'm not sure how to fit it into a function like mine which takes in all the files in the directory..Again, thank you very much! – Audrey Liu Apr 29 '19 at 07:20
  • 1
    @AudreyLiu check this https://stackoverflow.com/questions/44808489/error-handling-in-a-loop-that-outputs-a-dataframe and https://stackoverflow.com/questions/49834055/skip-errors-in-r-for-loops-and-also-pause-the-process-in-each-iteration – A. Suliman Apr 29 '19 at 08:27
  • 1
    @AudreyLiu Yes you could do this in two ways depending on how many columns your final `data.table` will contain. How many total columns will there be after the rbind? If there are very few the easiest and probably safest way would be to coerce all columns to character and then after rbindlist you could simply specify which columns you want to be numeric. Also how many columns are intended to be character vectors (i.e., categorical data or non-numeric data)? If there are only a few then it will also be easy to keep those as character vectors. – Jason Johnson Apr 29 '19 at 16:00

1 Answers1

4

So in looking at the documentation of fread I just noticed there is an integer64 option so are you dealing with integers greater than 2^31?

EDIT: I added the tryCatch which will print a formatted message to the console indicating which files are causing the error with the actual error message. However for rbindlist to then execute over the normal files you need to create a dummy list that will produce an extra column called ERROR which will have NAs in all rows except the bottom one(s) which will have the name of the problem file as its value(s).

I suggest after you run this code through once, delete the ERROR column and extra row(s) from the data.table and then save this combined file as a .csv. I would then move all the files that combined properly into a different folder and only have the current combined file and the ones that didn't load properly in the path. Then rerun the function without the colClasses specified. I combined everything into one script so it's hopefully less confusing:

#First Initial run without colClasses

  multmerge = function(mypath){
        filenames = list.files(path = mypath, full.names = TRUE)
        rbindlist(lapply(filenames,function(i) tryCatch(fread(i),
                                                        error = function(e) {
                                                                 cat("\nError reading in file:",i,"\t") #Identifies problem files by name
                                                                 message(e) #Prints error message without stopping loop
                                                                 list(ERROR=i) #Adds a placeholder column so rbindlist will execute
                                                                 })), #End of tryCatch and lapply 
                   fill = TRUE) #rbindlist arguments
    } #End of function

 #You should get the original error message and identify the filename.
  dataframe = multmerge(path)
 #Delete placeholder column and extra rows 
 #You will get as many extra rows as you have problem files - 
 #most likely just the one with column 25 or any others that had that same issue with column 25. 
 #Note the out of bounds error message will probably go away with the colClasses argument pulled out.)

 #Save this cleaned file to something like: fwrite(dataframe,"CurrentCombinedData.csv")
 #Move all files but problem file into new folder
 #Now you should only have the big one and only one in your path.
 #Rerun the function but add the colClasses argument this time

#Second run to accommodate the problem file(s) - We know the column 25 error this time but maybe in the future you will have to adapt this by adding the appropriate column.

  multmerge = function(mypath){
        filenames = list.files(path = mypath, full.names = TRUE)
        rbindlist(lapply(filenames,function(i) tryCatch(fread(i,colClasses = list(character = c(25))),
                                                        error = function(e) {
                                                                 cat("\nError reading in file:",i,"\t") #Identifies problem files by name
                                                                 message(e) #Prints error message without stopping loop
                                                                 list(ERROR=i) #Adds a placeholder column so rbindlist will execute
                                                                 })), #End of tryCatch and lapply
                   fill = TRUE) #rbindlist arguments
    } #End of function

   dataframe2 = multmerge(path)

Now we know the source of the error is column 25 which we can specify in colClasses. If you run the code and you get the same error message for a different column simply add the number of that column after the 25. Once you have the dataframe inputted I would check what is going on in that column (or any others if you must add other columns). Maybe there was a data entry error in one of the files or different encoding of an NA value. That's why I say to initially convert that column to character first because you will lose less information than converting to numeric first.

Once you have no errors always write the cleaned combined data.table to a csv that is contained in your folder and always move the individual files that have been combined into the other folder. That way when you add new files you will only be combining the big one and a few others so that in the future you can see what is going on easier. Just keep notes as to which files gave you trouble and which columns. Does that make sense?

Because files are often so idiosyncratic you will have to be flexible but this approach to the workflow should make it easy to identify problem files and add what you need to add to the fread to make it work. Basically archive the files that have been processed and keep track of exceptions like the column 25 one and keep the most current combined file and ones that haven't been processed together in the active path. Hope that helps and good luck!

Jason Johnson
  • 451
  • 3
  • 7
  • Hi Jason, I ran your code and successfully bound my csv files! Thank you so much for your help. Large numbers must have been the issue since I am recording time within experimental blocks in milliseconds, and I think that's how the large numbers came about! – Audrey Liu Apr 30 '19 at 10:31
  • Hi Jason! Thank you again for your answer! I encountered a different error though today when I tried to run the code again: even though the same exact code worked the day before. This time the error reads: "Error in fread(i, colClasses = list(character = c(25))) : Column number 25 (colClasses[[1]][1]) is out of range [1,ncol=1]." Do you have any idea why the function seems to only run once? Is the function somehow changing the underlying csv files every time it is run? – Audrey Liu May 01 '19 at 12:39
  • Did you add another csv file to your folder that maybe has less than 25 columns? If so then my suggestion would be to wrap the fread in a tryCatch statement. I will edit my answer to include that for you though one thing you might want to consider if you are constantly adding files is to save the bound csv data.table to a csv if successful and then move the old files that you’ve bound to a different folder. Then you can read in the big csv and any new ones which will make it easier for you to see what is going on. Does that make sense? – Jason Johnson May 01 '19 at 14:02
  • Hi Jason, I'm really sorry for the slow response and thank you for your patience! (I'm currently traveling and the internet connection is on and off.) The error handling workflow you suggested sounds like a great idea and it makes a lot of sense. I tried your edited function but unfortunately ran into another error while running the first multmerge(path) for identifying errors. The error message is as such: "Error in FUN(X[[i]], ...) : unused argument (error = function(e) { cat("\nError reading in file:", i, "\t") message(e) list(ERROR = i) })." – Audrey Liu May 06 '19 at 14:59
  • I looked for some documentation for trycatch and it doesn't look like it is implemented wrongly. Could it be that I am missing some package or another? Thank you very much for your time! – Audrey Liu May 06 '19 at 14:59
  • Did you copy and paste the function exactly or did you put everything on one line? If you move everything on one line you need to add a semi-colon between all the arguments in `function(e)`. I think that may be the source of the error. Everything else is part of base R. – Jason Johnson May 06 '19 at 17:56
  • I'd copied the function exactly as is. I also tried adding semicolons between all the arguments in function(e) (i.e. after each line), but I'm running into the same error. – Audrey Liu May 07 '19 at 05:02
  • I think I found the mistake and updated the code. The parentheses were not quite right but try it now and let me know if that helps. – Jason Johnson May 07 '19 at 06:23
  • Yes - that was the issue! It is working now. Thanks so much for your help! – Audrey Liu May 13 '19 at 15:31