4

I have a file with a list of numbers (make it for yourself: for x in $(seq 10000); do echo $x; done > file).

$> R -q -e "x <- read.csv('file', header=F); summary(x);"

> x <- read.csv('file', header=F); summary(x);
       V1       
 Min.   :    1  
 1st Qu.: 2501  
 Median : 5000  
 Mean   : 5000  
 3rd Qu.: 7500  
 Max.   :10000  

Now, one might expect cating the file and reading from /dev/stdin to have the same output, but it does not:

$> cat file | R -q -e "x <- read.csv('/dev/stdin', header=F); summary(x);"
> x <- read.csv('/dev/stdin', header=F); summary(x);
       V1       
 Min.   :    1  
 1st Qu.: 3281  
 Median : 5520  
 Mean   : 5520  
 3rd Qu.: 7760  
 Max.   :10000 

Using table(x) shows that a bunch of lines were skipped:

    1  1042  1043  1044  1045  1046  1047  1048  1049  1050  1051  1052  1053 
    1     1     1     1     1     1     1     1     1     1     1     1     1 
 1054  1055  1056  1057  1058  1059  1060  1061  1062  1063  1064  1065  1066 
    1     1     1     1     1     1     1     1     1     1     1     1     1
 ...

It looks like R is doing something funny with stdin, as this Python will properly print all the lines in the file:

cat file | python -c 'with open("/dev/stdin") as f: print f.read()'

This question seems related, but it is more about skipping lines in a malformed CSV file, whereas my input is just a list of numbers.

Community
  • 1
  • 1
Travis Gockel
  • 26,877
  • 14
  • 89
  • 116
  • 2
    The problem disappears if you use `stdin` instead of `/dev/stdin`. – Vincent Zoonekynd Jun 21 '12 at 04:40
  • @VincentZoonekynd that's interesting indeed! Any idea why, and which OS were you using? (linux, unix, darwin, cygwin) – Carl Witthoft Jun 21 '12 at 11:12
  • @CarlWitthoft: I do not know why it happens. I first thought that the input was buffered and that only the first buffer was read, but if I increase the size of the file, more data is read. Since `stdin` is recognized by R, it may interfering in some way with `/dev/stdin` and silently stealing some of the lines. I am on Linux. – Vincent Zoonekynd Jun 21 '12 at 12:19
  • I'm on Linux as well. And using plain old `stdin` is the fix (I'm new to R and didn't know it supported that). – Travis Gockel Jun 21 '12 at 15:51

1 Answers1

3

head --bytes=4K file | tail -n 3

yields this:

1039
1040
104

This suggests that R creates an input buffer on /dev/stdin, of size 4KB, and fills it during initialisation. When your R code then reads /dev/stdin, it starts in file at this point:

   1
1042
1043
...

Indeed, if in file you replace the line 1041 by 1043, you get a "3" instead of "1" in the table(x):

3  1042  1043  1044  1045  1046  1047  1048  1049  1050  1051  1052  1053 
1     1     1     1     1     1     1     1     1     1     1     1     1 
...

The first 1 in table(x) is actually the last digit of 1041. The first 4KB of file have been eaten.

jrouquie
  • 4,315
  • 4
  • 27
  • 43