I'm having trouble seeking around gzfiles in R. Here's an example:
set.seed(123)
m=data.frame(z=runif(10000),x=rnorm(10000))
write.csv(m,"m.csv")
system("gzip m.csv")
file.info("m.csv.gz")$size
[1] 195975
That creates m.csv.gz
, which R says it can seek on, and the help for seek
seems to agree:
gzf=gzfile("m.csv.gz")
open(gzf,"rb")
isSeekable(gzf)
[1] TRUE
Now small jumps, back and forth, seem to work, but if I try a big jump, I get an error:
seek(gzf,10)
[1] 10
seek(gzf,20)
[1] 10
seek(gzf,10)
[1] 20
seek(gzf,1000)
[1] 100
Warning message:
In seek.connection(gzf, 1000) :
seek on a gzfile connection returned an internal error
However if I reset the connection and start again, I can get to 1000 if I do it in 100-byte steps:
for(i in seq(100,1000,by=100)){seek(gzf,i)}
seek(gzf,NA)
[1] 1000
R has some harsh words on using seek
in Windows: "Use of ‘seek’ on Windows is discouraged." but this is on a Linux box (R 3.1.1, 32 bit). Similar code in python using the gz
library works fine, seeking all over.
R 3.2.0 is slightly more informative:
Warning messages:
1: In seek.connection(gzf, 1000) : invalid or incomplete compressed data
2: In seek.connection(gzf, 1000) :
seek on a gzfile connection returned an internal error
Ideas? I've submitted this as a bug report now.