2

An VERY simplified example of my dataset:

         HUC8 YEAR RO_MM
   1: 10010001 1961  78.2
   2: 10010001 1962  84.0
   3: 10010001 1963  70.2
   4: 10010001 1964 130.5
   5: 10010001 1965  54.3

I found this code online which sort of, but not quite, does what I want:

#create a list of the files from your target directory

file_list <- list.files(path="~/Desktop/Rprojects")

#initiate a blank data frame, each iteration of the loop will append the data from the given file to this variable

allHUCS <- data.frame()

#I want to read each .csv from a folder named "Rprojects" on my desktop into one huge dataframe for further use.

for (i in 1:length(file_list)){
  temp_data <- fread(file_list[i], stringsAsFactors = F) 
  allHUCS <- rbindlist(list(allHUCS, temp_data), use.names = T) 
}

Question: I have read that one should not use rbindlist for a large dataset:

"You should never ever ever iteratively rbind within a loop: performance might be okay in the beginning, but with each call to rbind it makes a complete copy of the data, so with each pass the total data to copy increases. It scales horribly. Consider do.call(rbind.data.frame, file_list)." – @r2evans

I know this may seem simple but I'm unclear about how to use his directive. Would I write this for the last line?

allHUCS <- do.call(rbind.data.frame(allHUCS, temp_data), use.names = T)

Or something else? In my actual data, each .csv has 2099 objects with 3 variables (but I only care about the last two.) The total dataframe should contain 47,000,000+ objects of 2 variables. When I ran the original code I got these errors:

Error in rbindlist(list(allHUCS, temp_data), use.names = T) : Item 2 has 2 columns, inconsistent with item 1 which has 3 columns. To fill missing columns use fill=TRUE.

In addition: Warning messages: 1: In fread(file_list[i], stringsAsFactors = F) : Detected 1 column names but the data has 2 columns (i.e. invalid file). Added 1 extra default column name for the first column which is guessed to be row names or an index. Use setnames() afterwards if this guess is not correct, or fix the file write command that created the file to create a valid file.

2: In fread(file_list[i], stringsAsFactors = F) : Stopped early on line 20. Expected 2 fields but found 3. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<# mv *.csv .. ; >>

Except for the setnames() suggestion, I don't understand what I'm being told. I know it says it stopped early, but I don't even know how to see the entire dataset or to tell where it stopped.

I'm now reading that rbindlist and rbind are two different things and rbindlist is faster than do.call(rbind, data). But the suggestion is do.call(rbind.data.frame(allHUCS, temp_data). Which is going to be fastest?

  • 4
    you need `do.call(rbind, lapply(file_list, fread))` – Onyambu May 26 '21 at 00:39
  • 4
    Or alternatively, if you are already using `data.table` functions, just do `rbindlist(lapply(file_list, fread))` . The whole point of comments you quoted is that growing objects in R can be inefficient. So it is best do do whatever operation to the hole object at once instead of in a `for loop`. The errors/warnings you are getting suggests that all the files you are reading may not have the same headers. To be safe just read everying into a list at first and worry about binding them together later, `df_list <- lapply(fille_list, fread); lapply(df_list, colnames) #inspect output` – Justin Landis May 26 '21 at 00:51
  • @JustinLandis A random sampling of them shows them to be the same: three columns named HUC8, YEAR, RO_MM. All 344,000+ csv files were created in the same operation. Is it still possible that they could have different headers? Also, could you please explain where `rbindlist(lapply(file_list, fread))` would fit into the code I have, like, instead of which line, or it it in lieu of the whole for loop? I really am new at this. – David Montana May 26 '21 at 02:24
  • 1
    This would replace the whole loop. The `lapply` function will apply the second argument (a function) to each element of the first argument, and return the results in a list. Then `rbindlist` will merge them into a single `data.frame`. as for the warnings, you might not know what the issue is until you read everything in. Just do it in two steps. Good luck – Justin Landis May 26 '21 at 13:19
  • A good discussion about lists-of-frames is here: https://stackoverflow.com/a/24376207/3358227. While that discussion often strays into keeping the frames as unique elements in the list, it does touch on how to combine them. – r2evans May 27 '21 at 01:22

1 Answers1

1

Since the original post does not include a reproducible example, here is one that reads data from the Pokémon Stats data that I maintain on Github.

First, we download a zip file containing one CSV file for each generation of Pokémon, and unzip it to the ./pokemonData subdirectory of the R working directory.

download.file("https://raw.githubusercontent.com/lgreski/pokemonData/master/PokemonData.zip",
          "pokemonData.zip",
          method="curl",mode="wb")

unzip("pokemonData.zip",exdir="./pokemonData")

Next, we obtain a list of files in the directory to which we unzipped the CSV files.

thePokemonFiles <- list.files("./pokemonData",
                              full.names=TRUE) 

Finally, we load the data.table package, use lapply() with data.table::fread() to read the files, combine the resulting list of data tables with do.call(), and print the head() and `tail() of the resulting data frame with all 8 generations of Pokémon stats.

library(data.table)

data <- do.call(rbind,lapply(thePokemonFiles,fread))

head(data)
tail(data)

...and the output:

> head(data)
   ID       Name Form Type1  Type2 Total HP Attack Defense Sp. Atk Sp. Def Speed
1:  1  Bulbasaur      Grass Poison   318 45     49      49      65      65    45
2:  2    Ivysaur      Grass Poison   405 60     62      63      80      80    60
3:  3   Venusaur      Grass Poison   525 80     82      83     100     100    80
4:  4 Charmander       Fire          309 39     52      43      60      50    65
5:  5 Charmeleon       Fire          405 58     64      58      80      65    80
6:  6  Charizard       Fire Flying   534 78     84      78     109      85   100
   Generation
1:          1
2:          1
3:          1
4:          1
5:          1
6:          1
> tail(data)
    ID      Name         Form   Type1 Type2 Total  HP Attack Defense Sp. Atk
1: 895 Regidrago               Dragon         580 200    100      50     100
2: 896 Glastrier                  Ice         580 100    145     130      65
3: 897 Spectrier                Ghost         580 100     65      60     145
4: 898   Calyrex              Psychic Grass   500 100     80      80      80
5: 898   Calyrex    Ice Rider Psychic   Ice   680 100    165     150      85
6: 898   Calyrex Shadow Rider Psychic Ghost   680 100     85      80     165
   Sp. Def Speed Generation
1:      50    80          8
2:     110    30          8
3:      80   130          8
4:      80    80          8
5:     130    50          8
6:     100   150          8
> 
Len Greski
  • 10,505
  • 2
  • 22
  • 33