149

I have a CSV file (24.1 MB) that I cannot fully read into my R session. When I open the file in a spreadsheet program I can see 112,544 rows. When I read it into R with read.csv I only get 56,952 rows and this warning:

cit <- read.csv("citations.CSV", row.names = NULL, 
                comment.char = "", header = TRUE, 
                stringsAsFactors = FALSE,  
                colClasses= "character", encoding= "utf-8")

Warning message:
In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
  EOF within quoted string

I can read the whole file into R with readLines:

rl <- readLines(file("citations.CSV", encoding = "utf-8"))
length(rl)
[1] 112545

But I can't get this back into R as a table (via read.csv):

write.table(rl, "rl.txt", quote = FALSE, row.names = FALSE)
rl_in <- read.csv("rl.txt", skip = 1, row.names = NULL)

Warning message:
In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
  EOF within quoted string

How can I solve or workaround this EOF message (which seems to be more of an error than a warning) to get the entire file into my R session?

I have similar problems with other methods of reading CSV files:

require(sqldf)
cit_sql <- read.csv.sql("citations.CSV", sql = "select * from file")
require(data.table)
cit_dt <- fread("citations.CSV")
require(ff)
cit_ff <- read.csv.ffdf(file="citations.CSV")

Here's my sessionInfo()

R version 3.0.1 (2013-05-16)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] tools     tcltk     stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ff_2.2-11             bit_1.1-10            data.table_1.8.8      sqldf_0.4-6.4        
 [5] RSQLite.extfuns_0.0.1 RSQLite_0.11.4        chron_2.3-43          gsubfn_0.6-5         
 [9] proto_0.3-10          DBI_0.2-7   
Ben
  • 41,615
  • 18
  • 132
  • 227

9 Answers9

227

You need to disable quoting.

cit <- read.csv("citations.CSV", quote = "", 
                 row.names = NULL, 
                 stringsAsFactors = FALSE)

str(cit)
## 'data.frame':    112543 obs. of  13 variables:
##  $ row.names    : chr  "10.2307/675394" "10.2307/30007362" "10.2307/4254931" "10.2307/20537934" ...
##  $ id           : chr  "10.2307/675394\t" "10.2307/30007362\t" "10.2307/4254931\t" "10.2307/20537934\t" ...
##  $ doi          : chr  "Archaeological Inference and Inductive Confirmation\t" "Sound and Sense in Cath Almaine\t" "Oak Galls Preserved by the Eruption of Mount Vesuvius in A.D. 79_ and Their Probable Use\t" "The Arts Four Thousand Years Ago\t" ...
##  $ title        : chr  "Bruce D. Smith\t" "Tomás Ó Cathasaigh\t" "Hiram G. Larew\t" "\t" ...
##  $ author       : chr  "American Anthropologist\t" "Ériu\t" "Economic Botany\t" "The Illustrated Magazine of Art\t" ...
##  $ journaltitle : chr  "79\t" "54\t" "41\t" "1\t" ...
##  $ volume       : chr  "3\t" "\t" "1\t" "3\t" ...
##  $ issue        : chr  "1977-09-01T00:00:00Z\t" "2004-01-01T00:00:00Z\t" "1987-01-01T00:00:00Z\t" "1853-01-01T00:00:00Z\t" ...
##  $ pubdate      : chr  "pp. 598-617\t" "pp. 41-47\t" "pp. 33-40\t" "pp. 171-172\t" ...
##  $ pagerange    : chr  "American Anthropological Association\tWiley\t" "Royal Irish Academy\t" "New York Botanical Garden Press\tSpringer\t" "\t" ...
##  $ publisher    : chr  "fla\t" "fla\t" "fla\t" "fla\t" ...
##  $ type         : logi  NA NA NA NA NA NA ...
##  $ reviewed.work: logi  NA NA NA NA NA NA ...

I think is because of this kind of lines (check "Thorn" and "Minus")

 readLines("citations.CSV")[82]
[1] "10.2307/3642839,10.2307/3642839\t,\"Thorn\" and \"Minus\" in Hieroglyphic Luvian Orthography\t,H. Craig Melchert\t,Anatolian Studies\t,38\t,\t,1988-01-01T00:00:00Z\t,pp. 29-42\t,British Institute at Ankara\t,fla\t,\t,"
dickoa
  • 18,217
  • 3
  • 36
  • 50
  • Thanks, that's an easy fix. Now what do you think about getting `fread` working in this situation? I prefer that because it's a lot faster than `read.csv`. But `fread` doesn't seem to take a `quote` argument.. – Ben Jul 01 '13 at 23:13
  • 1
    @Ben I tried to make it work too without success and as you pointed out `fread` doesn't play nice with embedded quote in general, but I'm sure there will be a workaround soon. http://stackoverflow.com/questions/16094025/data-tablefread-and-unbalanced – dickoa Jul 01 '13 at 23:20
  • 1
    I had 7,000 rows when I used `write.csv()` and was getting 403 back with `read.csv()`. Adding quote = "" got me up to 410 rows. `read.table()` does no better. I wonder what else can be tried... – Hack-R Aug 21 '14 at 15:12
  • 3
    Same problem as Hack-R, adding quote = "" increased my rowcount by 30,000 but I'm still missing over 200,000. – SJDS May 05 '15 at 09:51
  • 1
    Could you please write a line as to why you need to add that. (I am a Python programmer trying to learn R). Otherwise the answer is perfect(+1) – Bhargav Rao Jul 02 '15 at 16:40
  • I have put into that problem when I saved file from SQLite, then read it directly. This is proper solution. You are the best! – koralgooll Oct 08 '16 at 13:27
  • I you disable quoting what do you do with data that contain the same character as the separator? They will get split in two. – Luke Feb 09 '22 at 11:23
13

I'm a new-ish R user and thought I'd post this in case it helps anyone else. I was trying to read in data from a text file (separated with commas) that included a few Spanish characters and it took me forever to figure it out. I knew I needed to use UTF-8 encoding, set the header arg to TRUE, and that I need to set the sep arguemnt to ",", but then I still got hang ups. After reading this post I tried setting the fill arg to TRUE, but then got the same "EOF within quoted string" which I was able to fix in the same manner as above. My successful read.table looks like this:

target <- read.table("target2.txt", fill=TRUE, header=TRUE, quote="", sep=",", encoding="UTF-8")

The result has Spanish language characters and same dims I had originally, so I'm calling it a success! Thanks all!

Community
  • 1
  • 1
mjd876
  • 153
  • 1
  • 5
7

In the R help section, as pointed out above, just disabling quoting altogether, by simply adding:

    quote = "" 

to the read.csv() worked for me.

The error, "EOF within quoted string", occurred with:

    > iproscan.53A.neg     = read.csv("interproscan.53A.neg.n.csv",
    +                        colClasses=c(pb.id      = "character",
    +                                     genLoc     = "character",
    +                                     icode      = "character",
    +                                     length     = "character",
    +                                     proteinDB  = "character",
    +                                     protein.id = "character",
    +                                     prot.desc  = "character",
    +                                     start      = "character",
    +                                     end        = "character",
    +                                     evalue     = "character",
    +                                     tchar      = "character",
    +                                     date       = "character",
    +                                     ipro.id    = "character",
    +                                     prot.name  = "character",
    +                                     go.cat     = "character",
    +                                     reactome.id= "character"),
    +                                     as.is=T,header=F)
    Warning message:
    In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
      EOF within quoted string
    > dim(iproscan.53A.neg)
    [1] 69383    16

And the file read in was missing 6,619 lines. But by disabling quoting

    > iproscan.53A.neg     = read.csv("interproscan.53A.neg.n.csv",
    +                        colClasses=c(pb.id      = "character",
    +                                     genLoc     = "character",
    +                                     icode      = "character",
    +                                     length     = "character",
    +                                     proteinDB  = "character",
    +                                     protein.id = "character",
    +                                     prot.desc  = "character",
    +                                     start      = "character",
    +                                     end        = "character",
    +                                     evalue     = "character",
    +                                     tchar      = "character",
    +                                     date       = "character",
    +                                     ipro.id    = "character",
    +                                     prot.name  = "character",
    +                                     go.cat     = "character",
    +                                     reactome.id= "character"),
    +                                     as.is=T,header=F,**quote=""**)    
    > 
    > dim(iproscan.53A.neg)
    [1] 76002    16

Worked without error and all lines were successfully read in.

  • 6
    You are repeating an earlier answer and then crippling its utility by the addition of unnecessary flanking double asterisks inside the code block. – IRTFM Aug 10 '17 at 16:01
5

I too had the similar problem. But in my case, the cause of the issue was due to the presence of apostrophes (i.e. single quotation marks) within some of the text values. This is especially frequent when working with data including texts in French, e.g. «L'autre jour».

So, the solution was simply to adjust the default setting of the quote argument to exclude the «'» symbol, and thus, using quote = "\"" (i.e. double quotation mark only), everything worked fine.

I hope that can help some of you. Cheers.

marQIsoftGuy
  • 61
  • 1
  • 4
5

The readr package will fix this issue.

install.packages('readr')
library(readr)
readr::read_csv('yourfile.csv')
vladiim
  • 1,862
  • 2
  • 20
  • 27
  • This is the answer that solve the problem, I checked all above answers and does not worked for me. – Masoud Sep 20 '21 at 05:42
3

I also ran into this problem, and was able to work around a similar EOF error using:

read.table("....csv", sep=",", ...)

Notice that the separator parameter is defined within the more general read.table().

Arman H
  • 5,488
  • 10
  • 51
  • 76
Tony T
  • 39
  • 1
  • 2
    Hi, this doesn't works for me... I got an error Error in read.table(".csv", : more columns than column names - seems that skipping (skip = 6) doesn't work correctly... – maycca Oct 24 '15 at 22:57
3

Actually, using read.csv() to read a file with text content is not a good idea, disable the quote as set quote="" is only a temporary solution, it only worked with Separate quotation marks. There are other reasons would cause the warning, such as some special characters.

The permanent solution(using read.csv()), finding out what those special characters are and use a regular expression to eliminate them is an idea.

Have you ever think of installing the package {data.table} and use fread() to read the file. it is much faster and would not bother you with this EOF warning. Note that the file it loads it will be stored as a data.table object but not a data.frame object. The class data.table has many good features, but anyway, you can transform it using as.data.frame() if needed.

floatsd
  • 390
  • 2
  • 8
2

I had the similar problem: EOF -warning and only part of data was loading with read.csv(). I tried the quotes="", but it only removed the EOF -warning.

But looking at the first row that was not loading, I found that there was a special character, an arrow → (hexadecimal value 0x1A) in one of the cells. After deleting the arrow I got the data to load normally.

ElinaJ
  • 791
  • 1
  • 6
  • 18
0

I faced the same issue loading dataset row numbers larger than 100000; read.csv() probably has some limitations loading beyond the specific size (row number) of a dataset. Instead, you can use "fread()" function from "data.table" library