1

I'm trying to pull in a csv file to R but for some reason the data was output with quotation marks around each row so that if I try to pull it in with read.csv defaults it thinks each row is one observation. Using the option quote = "", at least it reads the columns correctly, but I'm still stuck with the quotation marks in the first and last cells. What's the fastest way to handle this in R without pre-formatting each file?

A sample:

"Year,Month,Day,Hour,Minute,Department,Division,Case_Type,Hundred_Block,Street_Name,Local_Area"
"2009,2,16,15,45,ENG - Streets,ZZ OLD - Streets Design (Road Ahead),Wheelchair Curb/Ramp Request,18##,SPYGLASS PLACE,Fairview"
"2009,2,20,9,6,ENG - Transportation,Neighbourhood Parking and Transportation,Residential Parking Requests,10##,E 20TH AV,Kensington-Cedar Cottage"
"2009,2,27,16,37,ENG - Streets,ZZ OLD - Streets Design (Road Ahead),Wheelchair Curb/Ramp Request,27##,BURRARD ST,Fairview"

EDIT

Thank MRFlick, maybe you can also mark this question as having higher quality answers?

Fo.
  • 3,752
  • 7
  • 28
  • 44
  • it's not very authoritative when the answer is not accepted and has zero votes – Fo. Jul 13 '14 at 15:39
  • The interpreter is the real authority. It either works or it doesn't ... – Ryogi Jul 13 '14 at 15:42
  • There are lots of things that work, I'm looking for the best answer. That's the purpose of the Stack Exchange format and what sets it apart from forums. – Fo. Jul 13 '14 at 15:43
  • I do think my answer to the linked question would work, and might be fast. Not going to bother copying it from there, but if someone else wants to they're welcome to. The linked question is harder because there are *embedded* quotation marks as well as the ones at the beginning and end of each line. – Ben Bolker Jul 13 '14 at 15:47

2 Answers2

4

1) Just read it twice. The first read will strip off the quotes and the second will parse it into fields:

DF0 <- read.table("myfile.dat", as.is = TRUE)
DF <- read.csv(text = DF0[[1]])

2) If speed is a consideration try a double fread from data.table:

library(data.table)

DT0 <- fread("myfile.dat", header = FALSE)
DT <- fread(paste(DT0[[1]], collapse = "\n"))
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • this is certainly fast/easy in terms of programming time. Stripping quotation marks with `sed`, if available, might be faster computationally. – Ben Bolker Jul 13 '14 at 15:44
  • 1
    `sed` is not a pure R solution. On Windows one would have to install `sed` or Rtools (which contains `sed`). – G. Grothendieck Jul 13 '14 at 15:58
  • I know. It's a tradeoff between speed and hassle. My other solution was `read.csv(text=gsub("(^\"|\"$)","",readLines("qtest.csv")))`, which might be *marginally* faster than your solution, but maybe not worth fussing about the difference. – Ben Bolker Jul 13 '14 at 16:18
  • The reason I haven't commented is because I have two other problems with this data set that are preventing me from successfully using any of the suggestions so far. 1) and invalid multibyte string at the start of the file, and 2) what R perceives as embedded nulls because of the double hashes (##) in the data as posted above. Here's the error from your second suggestion which I'm sure would otherwise work fine. `embedded nul in string: '\xff\xfe"\0Y\0e\0a\0r\0,...` – Fo. Jul 13 '14 at 16:41
  • That result is surprising; I would have suggested `comment.char=""`, but that's for `read.csv`, not `fread`. – Ben Bolker Jul 13 '14 at 16:44
  • ## should not cause a problem in either solution. `read.csv` uses `comment.char = ''` by default and `fread` does not pay attention to comment characters at all. – G. Grothendieck Jul 13 '14 at 17:05
  • Also, it looks like you have an encoding problem such as trying to read in a UTF-16 file without using UTF-16 encoding. You will need to play around with the `read.table` argument `fileEncoding`. – G. Grothendieck Jul 13 '14 at 17:14
  • That's it! finally. I've been playing around with fileEncoding all morning but not UTF-16. Your first suggestion is the most straight forward and easy to read, though for my particular case due to encoding the first step of the first suggestion had to be `read.table('data.csv', fileEncoding="UTF-16", comment.char="", as.is=T)` – Fo. Jul 13 '14 at 17:23
2

Assume df is the data frame where you loaded your data into, you can try this:

# df <- read.csv(...)

df[, 1]        <- sapply(strsplit(df[,1], '"')[[1]], `[`, 2) 
df[, ncol(df)] <- sapply(strsplit(df[,1], '"')[[1]], `[`, 1)

An alternative is

df[, 1]        <- sapply(df[,1], function(x) substr(x, 2, nchar(x))
df[, ncol(df)] <- sapply(df[,1], function(x) substr(x, 1, nchar(x) - 1))
Ryogi
  • 5,497
  • 5
  • 26
  • 46