7

I try to load a large datafile (some 20 million rows) using fread() from the data.table package. Some rows are causing great trouble, however.

Minimal example:

text.csv contains:

id, text
1,"""Oops"",\""The"",""Georgia"""        

fread("text.csv", sep=",")

Error in fread("text.csv", sep = ",") : 
  Not positioned correctly after testing format of header row. ch=','
In addition: Warning message:
In fread("text.csv", sep = ",") :
  Starting data input on line 2 and discarding line 1 because it has too few or too many items to be column names or data: id, text

read.table() works somewhat better but is too slow and too memory inefficient.

> read.table("text.csv", header = TRUE, sep=",")
  id                     text
1  1 "Oops",\\"The","Georgia"

I realize that my text file is not properly formatted, but it is too large to edit in a practical manner.

Any help much appreciated.

EDIT:

A small sample of actual data records:

sample1.txt, a good record:

materiale_id,dk5,description,creator,subject-phrase,title,type
125030-katalog:000000003,[78.793],Privatoptagelse. - Liveoptagelse,Frederik Lundin,,Koncert i Copenhagen Jazz House den 26.1.1995,music

> fread("sample1.txt", sep=",")
               materiale_id      dk5                      description         creator subject-phrase
1: 125030-katalog:000000003 [78.793] Privatoptagelse. - Liveoptagelse Frederik Lundin             NA
                                           title  type
1: Koncert i Copenhagen Jazz House den 26.1.1995 music


sample2.txt, a good and a bad record:

materiale_id,dk5,description,creator,subject-phrase,title,type
125030-katalog:000000003,[78.793],Privatoptagelse. - Liveoptagelse,Frederik Lundin,,Koncert i Copenhagen Jazz House den 26.1.1995,music
150012-leksikon:100019,,"Databehandling vedrører rutiner og procedurer for datarepræsentation, lagring af data, overførsel af data mellem forskellige instanser eller brugere af data, beregninger eller andre operationer udført med...",,"[""Informatik"",""it"",""It, teknik og naturvidenskab"",""leksikonartikel"",""Software, programmering, internet og webkommunikation""]",it - elementer i databehandling,article

> fread("sample2.txt", sep=",")
Empty data.table (0 rows) of 11 cols: 150012-leksikon:100019,V2,Databehandling vedrører rutiner og procedurer for datarepræsentation, lagring af data, overførsel af data mellem forskellige instanser eller brugere af data, beregninger eller andre operationer udført med...,V4,[""Informatik","it"...

EDIT 2:

Updating to R version 3.2.3 and data.table 1.9.6. helps on the above, but creates issues with other records:

sample3.txt, a good and a bad record:

materiale_id,dk5,description,creator,subject-phrase,title,type
125030-katalog:000236595,,,Red Tampa Solist prf,"[""Tom"",""Georgia"",""1929-1930""]","Georgia Tom, 1929-1930",music
125030-katalog:000236596,,,Jane Lucas (Solist),"[""1928-1931"",""Tom,\""The"",""Georgia"",""Accompanist""]","Georgia Tom,""The Accompanist"" (1928-1931)",music

> s3 <- fread("sample3.txt", sep=",")
Error in fread("sample3.txt", sep = ",") : 
  Expecting 7 cols, but line 3 contains text after processing all cols. It is very likely that this is due to one or more fields having embedded sep=',' and/or (unescaped) '\n' characters within unbalanced unescaped quotes. fread cannot handle such ambiguous cases and those lines may not have been read in as expected. Please read the section on quotes in ?fread.

EDIT 3:

Updating to the development version 1.9.7 of data tables breaks fread() altogether:

> s3 <- fread("sample3.txt", sep=",")
Error in fread("sample3.txt", sep = ",") : 
  showProgress is not type integer but type 'logical'. Please report.

EDIT 4:

It seems that the problem in my file occurs when records contain the string \\" (litteraly, not regular expression). Apparently, there's one backslash too many, causing fread() to misinterpret a double quote as the end of a string, where it should have been taken litteraly.

My best solutions so far is to do this:

m1 <- readLines("data.csv", encoding="UTF-8")
m2 <- gsub("\\\\\"", "\\\"", m1)    
writeLines(m2, "data_new.csv", useBytes = TRUE)
m3 <- fread("data_new.csv", encoding="UTF-8", sep=",")

That seems to work.

I don't understand this 100% though, so any clarifications are more than welcome.

  • If it's only header line, then you can skip 1st line on read in. And then pull it out again via `readLines()` and do proper adjustments. – statespace Feb 25 '16 at 12:08
  • Thank you, but it's data lines that are causing occational problems – Lasse Hjorth Madsen Feb 25 '16 at 12:42
  • 2
    You may want to use something like [this](http://stackoverflow.com/questions/29499145/preventing-column-class-inference-in-fread/29499512#29499512), which removes extra quotes (just the first part). – Rich Scriven Feb 28 '16 at 18:02
  • 1
    Issue similar to this was [recently fixed](https://github.com/Rdatatable/data.table/commit/7e4f3edd0bfd90edaf5c82500c957644ac871a60). Could you please try on `sample3` again? It works as expected for me now. – Arun Mar 17 '16 at 14:01

1 Answers1

2

Not a data.table solution, but you could try:

# read the file with 'readLines'
tmp <- readLines("trl.txt")

# create a column name vector of the first line
nms <- trimws(strsplit(tmp[1],',')[[1]])

# convert 'tmp' to a dataframe except the first line
tmp <- as.data.frame(tmp[-1])

# use 'separate' from 'tidyr' to split into two columns
library(tidyr)
df1 <- separate(tmp, "tmp[-1]", nms, sep=",", extra = "merge")

which gives:

> df1
  id                             text
1  1 """Oops"",\\""The"",""Georgia"""

Update for edit 1: With the new example data fread seems to be reading the data normally:

> s1 <- fread("sample1.txt", sep=",")
> s1
               materiale_id      dk5                      description         creator subject-phrase                                         title  type
1: 125030-katalog:000000003 [78.793] Privatoptagelse. - Liveoptagelse Frederik Lundin             NA Koncert i Copenhagen Jazz House den 26.1.1995 music


> s2 <- fread("sample2.txt", sep=",")
> s2
               materiale_id      dk5
1: 125030-katalog:000000003 [78.793]
2:   150012-leksikon:100019         
                                                                                                                                                                                                           description
1:                                                                                                                                                                                    Privatoptagelse. - Liveoptagelse
2: Databehandling vedrører rutiner og procedurer for datarepræsentation, lagring af data, overførsel af data mellem forskellige instanser eller brugere af data, beregninger eller andre operationer udført med...
           creator                                                                                                                         subject-phrase
1: Frederik Lundin                                                                                                                                       
2:                 [""Informatik"",""it"",""It, teknik og naturvidenskab"",""leksikonartikel"",""Software, programmering, internet og webkommunikation""]
                                           title    type
1: Koncert i Copenhagen Jazz House den 26.1.1995   music
2:               it - elementer i databehandling article

Update for edit 2 & 3:

When you look at the error-message:

Error in fread("sample3.txt", sep = ",") : Expecting 7 cols, but line 3 contains text after processing all cols. It is very likely that this is due to one or more fields having embedded sep=',' and/or (unescaped) '\n' characters within unbalanced unescaped quotes. fread cannot handle such ambiguous cases and those lines may not have been read in as expected. Please read the section on quotes in ?fread.

and then when you look at the second line of sample3.txt you will see that the fourth column contains comma's as well. You can solve this in three steps:

1: Read the file with readLines and replace the opening and closing character of the fourth column with another quote-character:

r3 <- readLines("sample3.txt")
r3 <- gsub('\"[',"'",r3,fixed=TRUE)
r3 <- gsub(']\"',"'",r3,fixed=TRUE)

2: Write it back to a text-file:

 writeLines(r3, "sample3-1.txt")

3: Now you can read it with fread (or read.table/read.csv). Because the number of column-titles is not the same as the number of columns, you will have to use header = FALSE. Also explicitely set the quote-character to the new quote-character as inserted in step 2:

s3 <- fread("sample3-1.txt", quote = "\'", header = FALSE, skip = 1)

which gives:

> s3
                         V1 V2 V3                   V4                                                        V5           V6                               V7    V8
1: 125030-katalog:000236595 NA NA Red Tampa Solist prf                         ""Tom"",""Georgia"",""1929-1930"" "Georgia Tom                       1929-1930" music
2: 125030-katalog:000236596 NA NA  Jane Lucas (Solist) ""1928-1931"",""Tom,\\""The"",""Georgia"",""Accompanist"" "Georgia Tom ""The Accompanist"" (1928-1931)" music

After that you can assign column names as follows:

names(s3) <- c("character","vector","with","eight","column","names")

NOTE: I used a pretty recent version (two weeks old) of v1.9.7 for this

Jaap
  • 81,064
  • 34
  • 182
  • 193
  • Thank you @Jaap. This is very innovative and helpful. It doesn't seem to quite work in my particular case, as there are text fields containing commas. I'm starting to think that the file is so irregular that there might not be a solution. Thanks anyway, it was an informative suggestion. – Lasse Hjorth Madsen Feb 27 '16 at 23:49
  • 1
    @LasseHjorthMadsen Could you add a larger sample of the text that is causing problems? – Jaap Feb 28 '16 at 07:55
  • I've added a couple of actual data points. In sample2.txt it appears that it is the 5th column containing a text string with both commas and quotes, that is causing problems. Thanks again, @Jaap – Lasse Hjorth Madsen Feb 28 '16 at 17:26
  • 1
    @LasseHjorthMadsen See the update. It looks like `fread` read both `sample1.txt` as well as `sample2.txt` the way it should. What version of *data.table* are you using? – Jaap Feb 28 '16 at 20:43
  • I'm using data.table 1.9.4, R version 3.1.2 for Windows. Could it possibly help to upgrade to R version 3.2.3? – Lasse Hjorth Madsen Feb 28 '16 at 21:30
  • 1
    @LasseHjorthMadsen The latest stable (CRAN) version is *v1.9.6* which has several improvements of `fread`. If that still doesn't work, you could also update to the [development version](https://github.com/Rdatatable/data.table/wiki/Installation) which includes some more improvements to `fread`. – Jaap Feb 28 '16 at 21:44
  • As descriped in EDIT 2 and EDIT 3, updating `data.table` helps some issues, but creates others. I can't get the development version _1.9.7_ to work. – Lasse Hjorth Madsen Feb 28 '16 at 22:53
  • I appreciate your effort so very much, @Jaap. However, creating an 8th column for some records creates problems for others. A sample of the first 200k records is here: [link](http://we.tl/VNDQBYTRN1). By the way I get different errors after doing`sample <- readLines("all data.csv, n=200000)` and then `writeLines(sample, "sample200k.txt")` then `newsample <- fread("sample200k.txt")` so perhaps R somehow changes the file? – Lasse Hjorth Madsen Feb 29 '16 at 09:56
  • If you would like the full, original dataset of some 2e7 rows, [here](http://oss-services.dbc.dk/files/materialedata.csv.gz) it is. It would also confirm that somethings appear to change after `writeLines(readLines("data.csv"), "newdata.csv)`. – Lasse Hjorth Madsen Feb 29 '16 at 20:57
  • @LasseHjorthMadsen I think I found a solution. Will update later today. – Jaap Mar 03 '16 at 07:55
  • I am all ears, @Jaap. In the meantime, Just last night, I've found some sort of workaround myself; will post in a minute – Lasse Hjorth Madsen Mar 03 '16 at 08:00
  • @LasseHjorthMadsen The idea I had, did run into other problems. So, I guess you will need to use the method you included in 'edit 4'. – Jaap Mar 03 '16 at 17:07