2

I am trying to read http://www.cboe.com/publish/ScheduledTask/MktData/datahouse/pcratioarchive.csv (more about the data here http://www.cboe.com/data/PutCallRatio.aspx) using

library(data.table)
download.file(url="http://www.cboe.com/publish/ScheduledTask/MktData/datahouse/pcratioarchive.csv", destfile="pcratioarchive.csv")
outDT <- fread("pcratioarchive.csv", header=FALSE, skip=4)

Somehow this detects strange error (which I can not see in pcratioarchive.csv file itself):

outDT <- fread("pcratioarchive.csv", header=FALSE, skip=4) Error in fread("pcratioarchive.csv", header = FALSE, skip = 4) : Expected sep (',') but new line, EOF (or other non printing character) ends field 2 on line 6 when detecting types: 12/2/1999,0.52

Is there a way to get this work with data.table without manually changing the pcratioarchive.csv ?

My session info:

sessionInfo() R version 3.1.1 (2014-07-10) Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8          LC_NUMERIC=C                 
 [3] LC_TIME=en_US.UTF-8           LC_COLLATE=en_US.UTF-8       
 [5] LC_MONETARY=en_US.UTF-8       LC_MESSAGES=en_US.UTF-8      
 [7] LC_PAPER=en_US.UTF-8          LC_NAME=en_US.UTF-8          
 [9] LC_ADDRESS=en_US.UTF-8        LC_TELEPHONE=en_US.UTF-8     
[11] LC_MEASUREMENT=en_US.UTF-8    LC_IDENTIFICATION=en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.9.3 rj_1.1.3-1      

loaded via a namespace (and not attached):
[1] plyr_1.8.1    Rcpp_0.11.1   reshape2_1.4  rj.gd_1.1.3-1 stringr_0.6.2
[6] tools_3.1.1  
Samo
  • 2,065
  • 20
  • 41

1 Answers1

3

The pcratioarchive.csv file is malformed. For example:

...
10/12/1995,0.63,,,
10/13/1995,0.76,,,
10/16/1995,0.87
10/17/1995,0.76
...
10/17/2003,0.64,,
10/20/2003,0.62,,
10/21/2003,0.7,1.27,0.59
10/22/2003,0.98,1.89,0.77
...

I'm not familiar enough with fread to know if it has arguments to handle this, but read.csv does.

x <- read.csv("pcratioarchive.csv", header=FALSE, skip=4, fill=TRUE)
Joshua Ulrich
  • 173,410
  • 32
  • 338
  • 418
  • 1
    `fread` doesn't have `fill` yet but it exists as [FR #536](https://github.com/Rdatatable/data.table/issues/536). If someone sensible like CBOE is writing files like that, maybe it should be higher priority then. – Matt Dowle Aug 17 '14 at 14:12
  • 1
    @MattDowle: let's not fool ourselves. Most CSV files are likely created by Excel, which can make even the most sensible institutions look foolish. Unless you're very careful, Excel can do all sorts of weird things to your CSV file. – Joshua Ulrich Aug 17 '14 at 15:33
  • 1
    Ahah - good spot. Indeed the charts on the CBOE page linked in question look distinctly like Excel. The Excel 2013 row limit seems to be 1,048,576 rows so I was thinking `fread(,fill=T)` could just divert to `read.csv(,fill=T)` since 1e6 is so small. However, there can be 16,384 columns too. A full Excel sheet could be 64GB if each cell is written as say 4 bytes. Maybe justifies `fread` supporting `fill=T` then. – Matt Dowle Aug 17 '14 at 17:53