fread() error and strange behaviour when reading csv

Question

I used fread() from data.table library to try read a 540MB csv file. It returned an error message saying:

' ends field 36 on line 4 when detecting types: 20.00,8/25/2006 0:00:00,"07:05:00 PM","CST",143.00,"OTTAWA","KS","HAIL",1.00,"S","MINNEAPOLIS",8/25/2006 0:00:00,"07:05:00 PM",0.00,,1.00,"S","MINNEAPOLIS",0.00,0.00,,88.00,0.00,0.00,0.00,,0.00,,"TOP","KANSAS, East",,3907.00,9743.00,3907.00,9743.00,"Dime to nickel sized hail.

I have no idea what caused the error and want to track down if it's a bug or just some data formating issue that I can tweak fread() to process.

I managed to read the csv using read.csv(), and decided to track down the row that triggered the error above (line 617174, not line 4 as the error message above). I then re-output the row and one row each immediately preceding and following the offending row, written out using write.csv() as testout.csv

I was able to read back testout.csv using read.csv() creating a data frame with 3 observations, as expected. Using fread() on testout.csv, however, resulted in a data table with only 1 observation, which is the last row.

The four lines in testout.csv are below (I start a new line for each entry below for readability).

"STATE__","BGN_DATE","BGN_TIME","TIME_ZONE","COUNTY","COUNTYNAME","STATE","EVTYPE","BGN_RANGE","BGN_AZI","BGN_LOCATI","END_DATE","END_TIME","COUNTY_END","COUNTYENDN","END_RANGE","END_AZI","END_LOCATI","LENGTH","WIDTH","F","MAG","FATALITIES","INJURIES","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP","WFO","STATEOFFIC","ZONENAMES","LATITUDE","LONGITUDE","LATITUDE_E","LONGITUDE_","REMARKS","REFNUM"

20,"8/25/2006 0:00:00","07:01:00 PM","CST",139,"OSAGE","KS","TSTM WIND",5,"WNW","OSAGE CITY","8/25/2006 0:00:00","07:01:00 PM",0,NA,5,"WNW","OSAGE CITY",0,0,NA,52,0,0,0,"",0,"","TOP","KANSAS, East","",3840,9554,3840,9554,".",617129

20,"8/25/2006 0:00:00","07:05:00 PM","CST",143,"OTTAWA","KS","HAIL",1,"S","MINNEAPOLIS","8/25/2006 0:00:00","07:05:00 PM",0,NA,1,"S","MINNEAPOLIS",0,0,NA,88,0,0,0,"",0,"","TOP","KANSAS, East","",3907,9743,3907,9743,"Dime to nickel sized hail. .",617130

20,"8/25/2006 0:00:00","07:07:00 PM","CST",125,"MONTGOMERY","KS","TSTM WIND",3,"N","COFFEYVILLE","8/25/2006 0:00:00","07:07:00 PM",0,NA,3,"N","COFFEYVILLE",0,0,NA,61,0,0,0,"",0,"","ICT","KANSAS, Southeast","",3705,9538,3705,9538,"",617131

When I ran fread("testout.csv", sep=",", verbose=TRUE), the output was

Input contains no \n. Taking this to be a filename to open
File opened, filesize is  1.05E-06B
File is opened and mapped ok
Detected eol as \r\n (CRLF) in that order, the Windows standard.
Looking for supplied sep ',' on line 5 (the last non blank line in the first 'autostart') ... found ok
Found 37 columns
First row with 37 fields occurs on line 5 (either column names or first row of data)
Some fields on line 5 are not type character (or are empty). Treating as a data row and using default column names.
Count of eol after first data row: 2
Subtracted 1 for last eol and any trailing empty lines, leaving 1 data rows
Type codes: 1444144414444111441111111414444111141 (first 5 rows)
Type codes: 1444144414444111441111111414444111141 (after applying colClasses and integer64)
Type codes: 1444144414444111441111111414444111141 (after applying drop or select (if supplied)

Any idea what may have caused the unexpected results, and the error in the first place? And any way around it? Just to be clear, my aim is to be able to use fread() to read the main file, even though read.csv() works so far.

What happens if you use `read.table` (invoking more options than available with `read.csv`) from the base package? — Carl Witthoft, Jun 21 '14 at 12:17
`dfrt <- read.table("testout.csv", header=TRUE, sep=",")` gave me the same result as the output from `read.csv`, as expected. — Ricky, Jun 21 '14 at 13:10
Are there line breaks in the file, perhaps inside quoted fields? If so, fread can't cope with that yet. On the line after the 4th one you show, in the penultimate field, which is "" on the 4th line. — Matt Dowle, Jun 21 '14 at 13:32
Or in the "dime to nickel" field. See those 2 periods. Is read.csv converting the newline to dot? I think fread should retain the newline when it deals with it - what you think? — Matt Dowle, Jun 21 '14 at 13:45
Thanks Matt, you're right. I just checked the data again, the 2 periods after "dime to nickel" has a newline in it, which doesn't reflect in the text above. I think `fread` should retain the newline if `sep` is explicitly set as something else. In auto behaviour, I guess the newline will confuse things. But in cases like csv file above, with the delimiter being quite explicit, I think newline should be treated just like any other characters. That seems to be the way `read.csv` handles it, which gives a reassuring consistency. — Ricky, Jun 21 '14 at 14:29
So if `fread` can't cope with newline in quoted fields yet, am I right that the only work around is still to read with other means e.g. `read.csv` into a data frame, and then convert the data frame to data table? — Ricky, Jun 21 '14 at 14:31
Yes I'm afraid so. It's been raised a few times so will try and get to it soon. — Matt Dowle, Jun 21 '14 at 14:41

score 4 · Accepted Answer · edited May 23 '17 at 12:04

4

UPDATE: Now fixed in v1.9.3 on GitHub :

fread() now accepts line breaks inside quoted fields. Thanks to Clayton Stanley for highlighting.
See: fread and a quoted multi-line column value

Windows users are reporting success with the latest version from GitHub.

edited May 23 '17 at 12:04

Community

1
1

answered Jun 25 '14 at 10:19

Matt Dowle

58,872
22
166
224

I am still having the same error as Ricky after installing v1.9.3 from GitHub. However, I am using a Mac platform (x86_64-apple-darwin13.1.0). I must be taking the same course that Ricky took because my assignment is identical. – C8H10N4O2 Sep 19 '14 at 19:16

fread() error and strange behaviour when reading csv

1 Answers1

Linked