An VERY simplified example of my dataset:
HUC8 YEAR RO_MM
1: 10010001 1961 78.2
2: 10010001 1962 84.0
3: 10010001 1963 70.2
4: 10010001 1964 130.5
5: 10010001 1965 54.3
I found this code online which sort of, but not quite, does what I want:
#create a list of the files from your target directory
file_list <- list.files(path="~/Desktop/Rprojects")
#initiate a blank data frame, each iteration of the loop will append the data from the given file to this variable
allHUCS <- data.frame()
#I want to read each .csv from a folder named "Rprojects" on my desktop into one huge dataframe for further use.
for (i in 1:length(file_list)){
temp_data <- fread(file_list[i], stringsAsFactors = F)
allHUCS <- rbindlist(list(allHUCS, temp_data), use.names = T)
}
Question: I have read that one should not use rbindlist for a large dataset:
"You should never ever ever iteratively rbind within a loop: performance might be okay in the beginning, but with each call to rbind it makes a complete copy of the data, so with each pass the total data to copy increases. It scales horribly. Consider do.call(rbind.data.frame, file_list)." – @r2evans
I know this may seem simple but I'm unclear about how to use his directive. Would I write this for the last line?
allHUCS <- do.call(rbind.data.frame(allHUCS, temp_data), use.names = T)
Or something else? In my actual data, each .csv has 2099 objects with 3 variables (but I only care about the last two.) The total dataframe should contain 47,000,000+ objects of 2 variables. When I ran the original code I got these errors:
Error in rbindlist(list(allHUCS, temp_data), use.names = T) : Item 2 has 2 columns, inconsistent with item 1 which has 3 columns. To fill missing columns use fill=TRUE.
In addition: Warning messages: 1: In fread(file_list[i], stringsAsFactors = F) : Detected 1 column names but the data has 2 columns (i.e. invalid file). Added 1 extra default column name for the first column which is guessed to be row names or an index. Use setnames() afterwards if this guess is not correct, or fix the file write command that created the file to create a valid file.
2: In fread(file_list[i], stringsAsFactors = F) : Stopped early on line 20. Expected 2 fields but found 3. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<# mv *.csv .. ; >>
Except for the setnames()
suggestion, I don't understand what I'm being told. I know it says it stopped early, but I don't even know how to see the entire dataset or to tell where it stopped.
I'm now reading that rbindlist and rbind are two different things and rbindlist is faster than do.call(rbind, data). But the suggestion is do.call(rbind.data.frame(allHUCS, temp_data). Which is going to be fastest?