0

This topic (Quickly reading very large tables as dataframes) investigate the same problem but not over the loop. I have 1000 different.txt file each one 200 mb with 1 million rows. What is the fastest way to read them over the loop then?

I have practiced the below ways with the reported computational time for a case of 10 files.

for (i in 1:10){
x<- read.delim()
# do something
}
# Time : 89 sec

for (i in 1:10){
x<- read.table()
# do something
}
# Time : 90 sec 

for (i in 1:10){
x <- fread()
# do something
}
# Time : 108  sec . !!!! (to my knowledge it is supposed to be the fastest, but when it comes to loop it is not the fastest though)!

foreach (i in 1:10)%dopar{
x<- read.delim()
# do something
}

# Time: 83 sec

foreach(i in 1:10) %dopar{
x <- fread()
# do something
}

# Time: 95 sec

I was told that disk.frame() package is the fastest. Could not try that yet. Need your thoughts, please. Can Laapply be applied to speed up the process?

Sean
  • 103
  • 9
  • I think the bottleneck at the speed of reading your large files, not at the `for` loop – ThomasIsCoding Jan 15 '20 at 08:19
  • You probably have only one drive, so parallelizing won't help much. Based on the times the bottleneck is the IO. You can compress the files by some lightweight compressor to make the IO shorter. – liborm Jan 15 '20 at 08:50
  • By the way - how much memory does your system have? Reading in 200 GB at once does not seem like a scalable idea... – liborm Jan 15 '20 at 08:50
  • @liborm , i read them one by one do process and save them back. I have 16 gig ram i7 quad core intel cpu. I didn’t get what do u mean by compressor method and IO. – Sean Jan 15 '20 at 09:02

2 Answers2

0

Maybe lapply() could help, as you suggested

myFiles <- list.files(pattern="txt$")
myList <- lapply(myFiles, function(x) fread(x))

I am also suprised that fread takes longer than read.table for you. When I had large files, fread really helped to read them in faster.

NicolasH2
  • 774
  • 5
  • 20
  • It’s my understanding that fread works better in reading very large single files. When it comes to loop it’s slower because of that graphical loading display. – Sean Jan 15 '20 at 09:05
  • I read the files one by one, #do something, and then save them again. – Sean Jan 15 '20 at 09:07
  • #do something can be done in a similar way. I you are familiar with dplyr pipes, just do: `lapply(myFiles, function(x) fread(x)) %>% lapply(function(x) do_something(x)) %>% {mapply(function(x,y) fwrite(x, paste0(y,"_changed.txt"), x=. , y=myFiles, SIMPLIFY=FALSE)}` – NicolasH2 Jan 15 '20 at 12:37
0

I'm adding this as an answer to get some more space than in the comments.

Working fast with 'big data'

200 GB of text files is reasonably big data which require significant effort to speed up the processing, or significant wait time. There's no easy way around ;)

  1. you need to get your data to memory to start any work
    • it is the fastest to read your files one by one (NOT in parallel) when reading from a single hard drive
    • measure how much time it takes to load the data without parsing
    • your load time for multiple similar files will be just a multiple of the single file time, you can't get any magic improvements here
    • to improve the load time you can compress the input files - it pays of only if you'll be using the same data source multiple times (after compression, less bytes must cross the hard drive -> memory boundary, which is slow)
    • when choosing how to compress the data, you aim at load(compressed)+decompress times to be smaller than load(decompressed)
  2. you need to parse the raw data
    • measure how much time it takes to parse the data
    • if you cannot separate the parsing, measure how much time it takes to load and parse the data, the parse time is then the difference to the previously measured load time
    • parsing can be parallelized, but it makes sense only if that is a substantial part of the load time
  3. you need to do your thing
    • this usually can be done in parallel
  4. you need to save the results
    • unless the results are as huge as the input, you don't care
    • if they're huge, you need to serialize your IO again, that is save it one by one, not in parallel
    • again compression helps, if you choose algorithm and settings where the compression time + write time is smaller than write time of the uncompressed data

To get raw load times, bash is your friend. Using pipe viewer or the builtin time you can easily check the time it takes to read through a file by doing

pv mydata.txt > /dev/null

# alternatively
time cat mydata.txt > /dev/null

Be aware that your disk cache will kick in, when you'll be repeatedly measuring a single file.

As for the compression, if you're stuck with R, gzip is the only reasonable option. If you'll do some pre-processing in bash, lz4 is the tool of choice, because it's really fast at decent compression ratios.

gzip -3 mydata.txt
pv mydata.txt.gz | zcat > /dev/null

Here we're getting to the pre-processing. It pays of to use UNIX tools which tend to be really fast to pre-process the data before loading to R. You can filter columns with cut, filter rows with mawk (which is often much faster than gawk).

liborm
  • 2,634
  • 20
  • 32