12

I want to stream a big data table into R LINE BY LINE, and if the current line has a specific condition (lets say the first columns is >15), add the line to a data frame in memory. I have written following code:

count<-1;
Mydata<-NULL;
fin <- FALSE;
while (!fin){
    if (count==1){
        Myrow=read.delim(pipe('cat /dev/stdin'), header=F,sep="\t",nrows=1);
        Mydata<-rbind(Mydata,Myrow);
        count<-count+1;
    }
    else {
        count<-count+1;
        Myrow=read.delim(pipe('cat /dev/stdin'), header=F,sep="\t",nrows=1);
        if (Myrow!=""){
        if (MyCONDITION){
            Mydata<-rbind(Mydata,Myrow);
        }
        }
        else
        {fin<-TRUE}
    }
}
print(Mydata);

But I get the error "data not available". Please note that my data is big and I don't want to read it all in once and apply my condition (in this case it was easy).

user1250144
  • 201
  • 2
  • 5
  • You may be interested in the answers and comments on this q: http://stackoverflow.com/questions/9352887/strategies-for-reading-in-csv-files-in-pieces – Ari B. Friedman Mar 26 '12 at 11:48
  • see `?scan`, `?readLines`, `nrows` argument of `read.table`, and be aware that your solution will be **very** slow in R -- can you use Perl, or even awk, to pre-process? – Ben Bolker Mar 26 '12 at 12:11
  • 1
    How would my answer below fare in terms of speed? In essence I open a file and keep extracting lines from it without closing the file. – Paul Hiemstra Mar 26 '12 at 12:24
  • Please note that I want read data line by line. My problem is how to tell R that data is streaming in and lines should be received one by one. This is also very easy in Perl, but I was looking for a way to do it in R. – user1250144 Mar 26 '12 at 12:26

2 Answers2

12

I think it would be wiser to use an R function like readLines. readLines supports only reading a specified number of lines, e.g. 1. Combine that with opening a file connection first, and then calling readLines repeatedly gets you what you want. When calling readLines multiple times, the next n lines are read from the connection. In R code:

stop = FALSE
f = file("/tmp/test.txt", "r")
while(!stop) {
  next_line = readLines(f, n = 1)
  ## Insert some if statement logic here
  if(length(next_line) == 0) {
    stop = TRUE
    close(f)
  }
}

Additional comments:

  • R has an internal way of treating stdin as file: stdin(). I suggest you use this instead of using pipe('cat /dev/stdin'). This probably makes it more robust, and definitely more cross-platform.
  • You initialize Mydata at the beginning and keep growing it using rbind. If the number of lines that you rbind becomes larger, this will get really slow. This has to do with the fact that when the object grows, the OS needs to find a new memory location for it, which ends up taking a lot of time. Better is to pre-allocate MyData, or use apply style loops.
Paul Hiemstra
  • 59,984
  • 12
  • 142
  • 149
  • Thanks for the answer. But I have a questions regarding it: As I mentioned, my data is very BIG and I don't want to read it into memory. In line 3 of your code, it seems that you are reading the whole data and then going through its lines. Am I right? – user1250144 Mar 26 '12 at 12:29
  • No, I open a connection and then read from it. `next_line` only contains only the current line. Using `file` only opens a connection, it does not read anything yet. – Paul Hiemstra Mar 26 '12 at 12:32
  • Ahan. thanks. what should I write instead of "/tmp/test.txt", the first argument of file() ? – user1250144 Mar 26 '12 at 12:38
  • I used the code you mentioned with file("stdin","r"); However, I cannot read more than 1 line when I stream into this in Linux using cat ToyData.txt | R --Vanilla --slave -f MyCode.R Does anybody know why? – user1250144 Mar 26 '12 at 12:50
  • For standard in use `stdin()`, not `"stdin"`. – Paul Hiemstra Mar 26 '12 at 12:50
  • 1
    And I think you do not need the piping, just use `file("ToyData.txt", "r")`. – Paul Hiemstra Mar 26 '12 at 12:51
  • To read from stdin without losing lots of lines, you may need to explicitely open it, as noted in [another question](http://stackoverflow.com/questions/9370609/piping-stdin-to-r). – Vincent Zoonekynd Mar 26 '12 at 13:18
  • But what would be the advantage of using bash to pipe the file into R and catching it with stdin, in stead of just opening a file connection to the file? – Paul Hiemstra Mar 26 '12 at 13:27
  • just a small correction: In the code above, the file should be shifted to *before* loop. – user1250144 Mar 26 '12 at 13:42
  • Reading from stdin gives you a lot of free shell functionality (which I guess separates CS from Data Science folks). – Sridhar Sarnobat Nov 02 '21 at 00:09
0

You can read stdin line by line using readLines like so:

#!/usr/bin/env Rscript
input <- file("stdin", "r")
while (length(l <- readLines(input, n=1)) > 0) {
    l <- l[[1]]  # isolate the first line
    cat(l)       # prints the line
    cat("\n")
}

The above script, simply replicate all stdin lines into stdout like so:

$ cat in.txt
first
second
third

$ Rscript script.r <in.txt 
first
second
third

Above, in the readLines function call, the n=1 is needed to tell readLines to read a maximum of one line, as we want to process everything line by line. Note readLines always returns a vector of length the number of lines read. Since we are using n=1 as argument, this means we either always get a vector of length one (1) when there are still lines to be read or zero (0) when we reach EOF (end-of-file) or some error. This is why in the while condition we check if the length is greater than zero. In this way, the loop will finish on EOF.

The original question also mentions checking a condition before saving the data in memory. Here's an example with a condition (print lines with the length of 5):

input <- file("stdin", "r")
while (length(l <- readLines(input, n=1)) > 0) {
    l <- l[[1]]
    if (nchar(l)==5) {
        cat(l)
        cat("\n")
    }
}

Here's the output for the new script:

$ Rscript five.r <in.txt 
first
third
Rudy Matela
  • 6,310
  • 2
  • 32
  • 37