8

i cannot load the file into RAM (assume a user might want the first billion of a file with ten billion records)

here is my solution, but i think there has got to be a faster way?

thanks

# specified by the user
infile <- "/some/big/file.txt"
outfile <- "/some/smaller/file.txt"
num_lines <- 1000


# my attempt
incon <- file( infile , "r") 
outcon <- file( outfile , "w") 

for ( i in seq( num_lines ) ){

    line <- readLines( incon , 1 )

    writeLines( line , outcon )

}

close( incon )
close( outcon )
Mike Williamson
  • 4,915
  • 14
  • 67
  • 104
Anthony Damico
  • 5,779
  • 7
  • 46
  • 77

8 Answers8

7

You can use ff::read.table.ffdf for this. It stores the data on the hard disk and it does not use any RAM.

library(ff)
infile <- read.table.ffdf(file = "/some/big/file.txt")

Essentially you can use the above function in the same way as base::read.table with the difference that the resulting object will be stored on the hard disk.

You can also use the nrow argument and load specific number of rows. The documentation is here if you want to have a read. Once, you have read the file, then you can subset the specific rows you need and even convert them to data.frames if they can fit the RAM.

There is also a write.table.ffdf function that will allow you to write an ffdf object (resulting from read.table.ffdf) which will make the process even easier.


As an example of how to use read.table.ffdf (or read.delim.ffdf which is pretty much the same thing) see the following:

#writting a file on my current directory
#note that there is no standard number of columns
sink(file='test.txt')
cat('foo , foo, foo\n')
cat('foo, foo\n')
cat('bar bar , bar\n')
sink()

#read it with read.delim.ffdf or read.table.ffdf
read.delim.ffdf(file='test.txt', sep='\n', header=F)

Output:

ffdf (all open) dim=c(3,1), dimorder=c(1,2) row.names=NULL
ffdf virtual mapping
   PhysicalName VirtualVmode PhysicalVmode  AsIs VirtualIsMatrix PhysicalIsMatrix PhysicalElementNo PhysicalFirstCol PhysicalLastCol PhysicalIsOpen
V1           V1      integer       integer FALSE           FALSE            FALSE                 1                1               1           TRUE
ffdf data
              V1
1 foo , foo, foo
2 foo, foo      
3 bar bar , bar 

If you are using a txt file then this is a general solution as each line will finish with a \n character.

LyzandeR
  • 37,047
  • 12
  • 77
  • 87
  • thanks, will this work on any file? what if the file is not a table? thanks – Anthony Damico Nov 17 '15 at 14:51
  • @AnthonyDamico You are welcome. Yes it will work on any file in the exact same way `read.table` would worked. If there is no delimiter (I assume that is what you mean when you say not a table) then the resulting object will have one column per line read. – LyzandeR Nov 17 '15 at 14:53
  • 1
    great, thanks. adding a bounty to get more attention and see if others have ideas, but i like this answer. – Anthony Damico Nov 21 '15 at 15:47
  • hi, this answer actually does not work as written because `read.table` does not automatically store the file as a single column. consider `tf <- tempfile() ; writeLines( c( "hi" , "hello hello" ) , tf ) ; read.table( tf ) ; read.table( tf , header = FALSE )` – Anthony Damico Nov 22 '15 at 11:39
  • Use an arbitrary separator and it will. Convert the last function to `read.table( tf , header = FALSE, sep='@')`. All will be stored as one column. This is a typical way of using `read.table`. – LyzandeR Nov 22 '15 at 11:44
  • after playing with this, this option is a lot buggier than i had hoped. tried to implement here, lots of exceptions to deal with https://github.com/ajdamico/asdfree/blob/5e82418e9fc9a0730fc5116d4f59852367494728/MonetDB/read.SAScii.monetdb.R#L138-L146 – Anthony Damico Nov 22 '15 at 12:08
  • yeah but then it breaks if the file has that arbitrary separator. makes it a one-off solution, not a general one :( – Anthony Damico Nov 22 '15 at 12:10
  • There is a general solution if you want each line to be in one column if you have a txt file and I find it to work every time. Just use the break line delimiter. I ll provide an example as well. – LyzandeR Nov 22 '15 at 14:14
  • `sep='\n'` is a good idea. note there are still plenty of problem points, many of which can be worked around with `infile <- read.table.ffdf( file = fn , nrows = n_max , header = FALSE , sep = "\n" , colClasses = "factor" , row.names = NULL , quote = '' , na.strings = NULL , comment.char = "" )` thanks again – Anthony Damico Nov 22 '15 at 15:49
  • Just to clarify: doesn't this imply that the entire input file has to get copied into the format required by the [ff](http://cran.r-project.org/package=ff) package? That would have cost in _space_ (extra file) and _time_ (copying). – Dirk Eddelbuettel Nov 22 '15 at 18:12
  • @DirkEddelbuettel The file is read in chunks by `read.table` internally. Does `read.table` create a new (temporary) copy of the file that I do not know of (I haven't read the source code of the function)? `read.table.ffdf` is converting that chunk into an `ffdf` object in order to be saved on the hard disk. – LyzandeR Nov 22 '15 at 19:28
  • First paragraph of _Details:_ in `help(read.table.ffdf)`: _‘read.table.ffdf’ has been designed to read very large (many rows) separated flatfiles in row-chunks and **store the result in a ‘ffdf’ object on disk**, but quickly accessible via ‘ff’ techniques._ – Dirk Eddelbuettel Nov 22 '15 at 19:33
  • Also, `ff` doesn't handle character vectors well; these are converted to factors of which the levels are stored in memory. – Jan van der Laan Nov 25 '15 at 10:42
  • @LyzandeR Unless you are reading a text file with mostly unique lines (as in this case probably), or when the character columns are unique identifiers. Not that I don't like/use `ff`, but it is a problem, especially in this case. – Jan van der Laan Nov 25 '15 at 11:08
  • @JanvanderLaan Yeah that's what I thought and deleted my previous comment. I looked for any information regarding this, but I couldn't find anything that explicitly says how the information is stored. I found a bit that says that `ramclass` objects are not really in RAM for the whole time, but they get parsed in RAM when needed in chunks. It would be weird for the authors of the package to have all the levels in RAM because big files with lots of character columns it would still fail. I didn't find any issues on github either (about any complaints that reading a file failed). – LyzandeR Nov 25 '15 at 11:30
6

I like pipes for that as we can use other tools. And conveniently, the (truly excellent) connections interface in R supports it:

## scratch file
filename <- "foo.txt"               

## create a file, no header or rownames for simplicity
write.table(1:50, file=filename, col.names=FALSE, row.names=FALSE)   

## sed command:  print from first address to second, here 4 to 7
##               the -n suppresses output unless selected
cmd <- paste0("sed -n -e '4,7p' ", filename)
##print(cmd)                        # to debug if needed

## we use the cmd inside pipe() as if it was file access so
## all other options to read.csv (or read.table) are available too
val <- read.csv(pipe(cmd), header=FALSE, col.names="selectedRows")
print(val, row.names=FALSE)

## clean up
unlink(filename)

If we run this, we get rows four to seven as expected:

edd@max:/tmp$ r piper.R 
 selectedRows
            4
            5
            6
            7
edd@max:/tmp$ 

Note that our use of sed made no assumptions about the file structures besides assuming

  • standard "ascii" text file to be read in text mode
  • standard CR/LF line endings as 'record separators'

If you assumed binary files with different record separators we could suggest different solutions.

Also note that you control the command passed onto the pipe() functions. So if you want rows 1000004 to 1000007 the usage is exactly the same: you just give the first and last row (of each segment, there can be several). And instead of read.csv() your readLines() could be used equally well.

Lastly, sed is available everywhere and, if memory serves, also part of Rtools. The basic filtering functionality can also be obtained with Perl or a number of other tools.

Dirk Eddelbuettel
  • 360,940
  • 56
  • 644
  • 725
  • this looks great! if a new user has a fresh install of base R on windows/mac/unix without any rtools/rstudio anything, are they going to have `sed` working? not sure why it'd be a part of rtools if it comes pre-packaged with every os already? the other answer can be implemented across platforms as easily as `install.packages("ff")` -- is that true with `sed as well? thanks – Anthony Damico Nov 21 '15 at 19:17
  • `sed` is a binary you need in the path, see eg [here for Windows](http://gnuwin32.sourceforge.net/packages/sed.htm), or [here from SuperUser](http://superuser.com/questions/390241/sed-for-windows). OS X and Linux will have it. If _you_ want to / need to provide a solution you could ship a `sed` binary for windows in a package and refer to it in the `cmd` variable. – Dirk Eddelbuettel Nov 21 '15 at 19:33
6

C++ solution

It is not too difficult to write some c++ code for this:

#include <fstream>
#include <R.h>
#include <Rdefines.h>

extern "C" {

  // [[Rcpp::export]]
  SEXP dump_n_lines(SEXP rin, SEXP rout, SEXP rn) {
    // no checks on types and size
    std::ifstream strin(CHAR(STRING_ELT(rin, 0)));
    std::ofstream strout(CHAR(STRING_ELT(rout, 0)));
    int N = INTEGER(rn)[0];

    int n = 0;
    while (strin && n < N) {
      char c = strin.get();
      if (c == '\n') ++n;
      strout.put(c);
    }

    strin.close();
    strout.close();
    return R_NilValue;
  }
}

When saved as yourfile.cpp, you can do

Rcpp::sourceCpp('yourfile.cpp')

From RStudio you don't have to load anything. In the console you will have to load Rcpp. You will probably have to install Rtools in Windows.

More efficient R-code

By reading larger blocks instead of single lines your code will also speed up:

dump_n_lines2 <- function(infile, outfile, num_lines, block_size = 1E6) {
  incon <- file( infile , "r") 
  outcon <- file( outfile , "w") 

  remain <- num_lines

  while (remain > 0) {
    size <- min(remain, block_size)
    lines <- readLines(incon , n = size)
    writeLines(lines , outcon)
    # check for eof:
    if (length(lines) < size) break 
    remain <- remain - size
  }
  close( incon )
  close( outcon )
}

Benchmark

lines <- "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean commodo
imperdiet nunc, vel ultricies felis tincidunt sit amet. Aliquam id nulla eu mi
luctus vestibulum ac at leo. Integer ultrices, mi sit amet laoreet dignissim,
orci ligula laoreet diam, id elementum lorem enim in metus. Quisque orci neque,
vulputate ultrices ornare ac, interdum nec nunc. Suspendisse iaculis varius
dapibus. Donec eget placerat est, ac iaculis ipsum. Pellentesque rhoncus
maximus ipsum in hendrerit. Donec finibus posuere libero, vitae semper neque
faucibus at. Proin sagittis lacus ut augue sagittis pulvinar. Nulla fermentum
interdum orci, sed imperdiet nibh. Aliquam tincidunt turpis sit amet elementum
porttitor. Aliquam lectus dui, dapibus ut consectetur id, mollis quis magna.
Donec dapibus ac magna id bibendum."
lines <- rep(lines, 1E6)
writeLines(lines, con = "big.txt")

infile <- "big.txt"
outfile <- "small.txt"
num_lines <- 1E6L


library(microbenchmark)
microbenchmark(
  solution0(infile, outfile, num_lines),
  dump_n_lines2(infile, outfile, num_lines),
  dump_n_lines(infile, outfile, num_lines)
  )

Results in (solution0 is the OP's original solution):

Unit: seconds
                                     expr       min        lq      mean    median        uq       max neval cld
    solution0(infile, outfile, num_lines) 11.523184 12.394079 12.635808 12.600581 12.904857 13.792251   100   c
dump_n_lines2(infile, outfile, num_lines)  6.745558  7.666935  7.926873  7.849393  8.297805  9.178277   100  b 
 dump_n_lines(infile, outfile, num_lines)  1.852281  2.411066  2.776543  2.844098  2.965970  4.081520   100 a 

The c++ solution can probably be sped up by reading in large blocks of data at a time. However, this will make the code much more complex. Unless this is something I would have to do on a very regular basis, I would probably stick with the pure R solution.

Remark: when your data is tabular, you can use my LaF package to read arbitrary lines and columns from your data set without having to read all of the data into memory.

Jan van der Laan
  • 8,005
  • 1
  • 20
  • 35
  • That's not very idiomatic Rcpp code. Why `SEXP` in the interface? It's not 2009 anymore ;-) Also, this answer has much installation requirements than the answers by @jack-wasey or myself which just need a binary on Windows -- not an entire compiler toolchain. Not that I am against Rcpp, but apples-to-apples criticism may be appropriate. – Dirk Eddelbuettel Nov 28 '15 at 16:31
  • @DirkEddelbuettel you are right on both accountants. Lately, I have been writing mostly pre-2009 C++ code and my Rcpp knowledge is a bit rusty. I don't claim to be writing Rcpp code in my answer. I only use Rcpp to source the files in Rstudio, which is easier than using `R CMD SHLIB`. I believe that the `readLines` solution using blocks is the fastest pure R solution. You could place the C++ code in a small package and put that in a drat repo. A user wouldn't need R tools then. – Jan van der Laan Nov 28 '15 at 19:32
  • It was more a comment than a criticism, and a poorly stated inquiry. Should we brush it up and use C++ streams, or at least Rcpp wrapping to reduce th e length in half? – Dirk Eddelbuettel Nov 28 '15 at 19:34
  • @DirkEddelbuettel Feel free to edit the answer. If not I will give it a brush up in a few days when I have more time. – Jan van der Laan Nov 28 '15 at 19:50
3

I usually speed up such loops by reading and writing by chunks of, say, 1000 lines. If num_lines is a multiple of 1000, the code becomes:

# specified by the user
infile <- "/some/big/file.txt"
outfile <- "/some/smaller/file.txt"
num_lines <- 1000000


# my attempt
incon <- file( infile, "r") 
outcon <- file( outfile, "w") 

step1 = 1000
nsteps = ceiling(num_lines/step1)

for ( i in 1:nsteps ){
    line <- readLines( incon, step1 )
    writeLines( line, outcon )  
}

close( incon )
close( outcon )
Andrey Shabalin
  • 4,389
  • 1
  • 19
  • 18
  • good point! two notes: (1) `while( length ( line <- readLines( incon , 1000 ) ) > 0 ) {` is nicer sometimes and (2) your code will be inexact if `num_lines` is not a multiple of `step1` and so will my `while()` statement – Anthony Damico Nov 22 '15 at 12:28
  • I have a slightly more complicated version for the general case. That's why I used `ceiling` function for `nsteps`. – Andrey Shabalin Nov 22 '15 at 19:15
  • Also, your code that is 'nicer sometimes' reads the whole file, not only `num_lines` lines. – Andrey Shabalin Nov 22 '15 at 19:19
3

The operating system is the best level to do big file manipulations. This is quick, and comes with a benchmark (which seems important, given the poster asked about a faster method):

# create test file in shell 
echo "hello
world" > file.txt
for i in {1..29}; do cat file.txt file.txt > file2.txt && mv file2.txt file.txt; done
wc -l file.txt
# about a billion rows

This takes a few seconds for a billions rows. Change 29 to 32 in order to get about ten billion.

Then in R, using ten million rows from the billion (hundred million way too slow to compare with poster's solution)

# in R, copy first ten million rows of the billion
system.time(
  system("head -n 10000000 file.txt > out.txt")
)

# posters solution
system.time({
  infile <- "file.txt"
  outfile <- "out.txt"
  num_lines <- 1e7
  incon <- file( infile , "r") 
  outcon <- file( outfile , "w") 

  for ( i in seq( num_lines )) {
    line <- readLines( incon , 1 )
    writeLines( line , outcon )
  }

  close( incon )
  close( outcon )
})

And the results on a mid-range MacBook pro, couple of years old.

Rscript head.R
   user  system elapsed 
  1.349   0.164   1.581 
   user  system elapsed 
620.665   3.614 628.260

Would be interested to see how fast the other solutions are.

Jack Wasey
  • 3,360
  • 24
  • 43
  • will `system("head -n 10000000 file.txt > out.txt")` work on every operating system? thanks – Anthony Damico Nov 23 '15 at 22:26
  • `Rtools` on Windows does ship with MinGW which includes `head`. I'm not sure whether R also includes any UNIX-like tools at all. – Jack Wasey Nov 23 '15 at 23:26
  • I see from another stackoverflow answer that `powershell -command "& {Get-Content *filename* -TotalCount *n*}"` does `head` for Windows since XP, so this could be dynamically used if Windows is detected. https://stackoverflow.com/questions/1295068/windows-equivalent-of-the-tail-command – Jack Wasey Nov 23 '15 at 23:27
  • This answer is close in spirit to mine. Substitute `pipe()` for `system()` and you can do without the temporary files. – Dirk Eddelbuettel Nov 25 '15 at 02:03
  • @DirkEddelbuettel i really like this answer since i think `head` works cross-platform without any outside software? or do i have that wrong? – Anthony Damico Nov 25 '15 at 19:55
  • You are the judge, not me, but `head` is as foreign to the damned Windows world as is `sed`. And my approach allows any lines from _i_ to _j_ -- which you can proxy with `head` and `tail`. But this is your problem, and maybe you should do some measurements on the systems and files relevant to your question? – Dirk Eddelbuettel Nov 25 '15 at 20:44
  • I don't think head is cross platform unless rtools is installed. The commands I mention above could be used if Windows detected, as a non rtools solution. I can't update my answer for a day or so, and Dirk's is in the same vein. – Jack Wasey Nov 25 '15 at 20:49
  • ahh sorry for some reason `head` was part of my windows `path` but `sed` is not. if we drop the requirement that it work on any R user's computer without external software, then i agree `sed` is better. – Anthony Damico Nov 26 '15 at 20:49
2

The "right" or best answer for this would be to use a language that works much more easily with filehandles. For instance, while perl is an ugly language in many ways, this is where it shines. Python can also do this very well, in a more verbose fashion.


However, you have explicitly stated you want things in R. First, I'll assume that this thing might not be a CSV or other delimited flat file.

Use the library readr. Within that library, use read_lines(). Something like this (first, get the # of lines in the entire file,using something like what is shown here):

library(readr)

# specified by the user
infile <- "/some/big/file.txt"
outfile <- "/some/smaller/file.txt"
num_lines <- 1000


# readr attempt
# num_lines_tot is found via the method shown in the link above
num_loops <- ceiling(num_lines_tot / num_lines)
incon <- file( infile , "r") 
outcon <- file( outfile , "w") 

for ( i in seq(num_loops) ){

    lines <- read_lines(incon, skip= (i - 1) * num_lines,
                        n_max = num_lines)
    writeLines( lines , outcon )
}

close( incon )
close( outcon )

A few things to note:

  1. There is no nice, convenient way to write in the library readr that is as generic as it seems you want. (There is, for instance, write_delim, but you did not specify delimited.)
  2. All of the information that is in the previous incarnations of the "outfile" will be lost. I am not sure if you meant to open "outfile" in append mode ("a"), but I suspect that would be helpful.
  3. I have found when working with large files like this, often I'll want to do filtering of the data, while opening it like this. Doing the simple copy seems strange. Maybe you want to do more?
  4. If you had a delimited file, you'd want to look at read_csv or read_delim within the readr package.
Mike Williamson
  • 4,915
  • 14
  • 67
  • 104
  • nice, but FWIW the poster asked for "within R", not "in R." – Jack Wasey Nov 23 '15 at 21:36
  • @JackWasey OK. But I'm not sure I understand the distinction. Do you mean that the OP did not want to use any libraries? Or do you mean that system calls are acceptable? If the latter, I still feel that `readr` is better, since it is highly optimized. – Mike Williamson Nov 24 '15 at 06:03
  • I suppose I was thinking that "within R" just implied "within the R working environment", not strictly using just the R language. – Jack Wasey Nov 24 '15 at 09:53
2

Try the headutility. It should be available on all operating systems that R supports (on Windows it assumes you have Rtools installed and the Rtools bin directory is on your path). For example, to copy the first 100 lines from in.dat to out.dat :

shell("head -n 100 in.dat > out.dat")
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
-2

try using

line<-read.csv(infile,nrow=1000)
write(line,file=outfile,append=T)
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
user_flow
  • 179
  • 1
  • 11