3

I have to work with a .csv file that comes like this:

"IDEA ID,""IDEA TITLE"",""VOTE VALUE"""
"56144,""Net Present Value PLUS (NPV+)"",1"
"56144,""Net Present Value PLUS (NPV+)"",1"

If I use read.csv, I obtain a data frame with one variable. What I need is a data frame with three columns, where columns are separated by commas. How can I handle the quotes at the beginning of the line and the end of the line?

Blisskarthik
  • 1,246
  • 8
  • 20
user3819143
  • 31
  • 1
  • 2
  • 1
    Please see `quote` option of `read.csv`. –  Jul 09 '14 at 06:56
  • This is probably the answer you need: http://stackoverflow.com/a/17415028/1036500 – Ben Jul 09 '14 at 07:12
  • possible duplicate of [read.csv warning 'EOF within quoted string' prevents complete reading of file](http://stackoverflow.com/questions/17414776/read-csv-warning-eof-within-quoted-string-prevents-complete-reading-of-file) – Nikolay Nenov Jul 09 '14 at 09:49
  • I don't think these linked questions will actually answer the OP's question ... – Ben Bolker Jul 09 '14 at 15:56
  • Yikes. How did someone manage to screw up a csv file so badly? Do you have any idea why they formatted it this way? – MrFlick Jul 09 '14 at 16:00

2 Answers2

1

I don't think there's going to be an easy way to do this without stripping the initial and terminal quotation marks first. If you have sed on your system (Unix [Linux/MacOS] or Windows+Cygwin?) then

read.csv(pipe("sed -e 's/^\"//' -e 's/\"$//' qtest.csv"))

should work. Otherwise

read.csv(text=gsub("(^\"|\"$)","",readLines("qtest.csv")))

is a little less efficient for big files (you have to read in the whole thing before processing it), but should work anywhere.

(There may be a way to do the regular expression for sed in the same, more-compact form using parentheses that the second example uses, but I got tired of trying to sort out where all the backslashes belonged.)

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • Would you also recommend changing the back-to-back double quote in the middle of the string to a single double quote? – MrFlick Jul 09 '14 at 16:02
  • not sure what you mean here. The middle of which string? The regexp in the second example is meant to signify "beginning of line followed by double-quote OR double-quote followed by end of line" ... – Ben Bolker Jul 09 '14 at 16:19
  • Right. You are dealing with the ones at the end of the lines (if I'm reading that right). There is also `""IDEA TITLE""`. Won't those quotes mess things up? – MrFlick Jul 09 '14 at 16:26
  • @Fo.: I didn't think so, but those answers do work (with a bit of adaptation) here. However, I still think my `sed`-based answer might be faster, where `sed` is available ... – Ben Bolker Jul 17 '14 at 23:01
  • @Fo., are you repeating yourself? – Ben Bolker Jul 17 '14 at 23:23
0

I suggest both removing the initial/terminal quotes and turning the back-to-back double quotes into single double quotes. The latter is crucial in case some of the strings contain commas themselves, as in

"1,""A mostly harmless string"",11"
"2,""Another mostly harmless string"",12"
"3,""These, commas, cause, trouble"",13"

Removing only the initial/terminal quotes while keeping the back-to-back quote leads the read.csv() function to produce 6 variables, as it interprets all commas in the last row as value separators. So the complete code might look like this:

data.text <- readLines("fullofquotes.csv")  # Reads data from file into a character vector.
data.text <- gsub("^\"|\"$", "", data.text) # Removes initial/terminal quotes.
data.text <- gsub("\"\"", "\"", data.text)  # Replaces "" by ".
data <- read.csv(text=data.text, header=FALSE)

Or, of course, all in a single line

data <- read.csv(text=gsub("\"\"", "\"", gsub("^\"|\"$", "", readLines("fullofquotes.csv", header=FALSE))))