2

When using write.table or write.csv in R, double quotes are added around all non-numeric fields by default regardless of whether the quotes are actually required for proper parsing of the csv file.

Take a Python script as an example:

import csv
f_out=open("pytest.csv", "w")
wri = csv.writer(f_out, delimiter=',')
wri.writerow(['c_numeric', 'c_str', 'c_str_spec'])
wri.writerow([11, "r1c2", "r1c3 nothing special"])
wri.writerow([21, "r2c2", "r2c3,with delim"])
wri.writerow([31, "r3c2", "r3c3\nwith carriage return"])
wri.writerow([41, "r4c2", "r3c3\"with double quote"])
f_out.close()

This will output the following into pytest.csv:

c_numeric,c_str,c_str_spec
11,r1c2,r1c3 nothing special
21,r2c2,"r2c3,with delim"
31,r3c2,"r3c3
with carriage return"
41,r4c2,"r3c3""with double quote"

This is what I expect and follows what Excel will output also.

Now let's process this file using R and write with and without quotes:

df <- read.csv("pytest.csv")
write.csv(df, 'Rtest.csv', row.names=FALSE)
write.csv(df, 'Rtest_NQ.csv', row.names=FALSE, quote=FALSE)

Here is Rtest.csv:

"c_numeric","c_str","c_str_spec"
11,"r1c2","r1c3 nothing special"
21,"r2c2","r2c3,with delim"
31,"r3c2","r3c3
with carriage return"
41,"r4c2","r3c3""with double quote"

Notice the quotes around all non-numeric fields.

Here is Rtest_NQ.csv:

c_numeric,c_str,c_str_spec
11,r1c2,r1c3 nothing special
21,r2c2,r2c3,with delim
31,r3c2,r3c3
with carriage return
41,r4c2,r3c3"with double quote

This file is technically corrupt as it's unreadable by any csv reader — so not a good option.

My question: Is there any rfc4180 compatible writer in R that writes like that of Excel or python csv library and most other rfc4180 compatible tools?

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
Timothy C. Quinn
  • 3,739
  • 1
  • 35
  • 47
  • 2
    Is there a particular problem with the superfluous quotes? For what it's worth, `pandas.read_csv()` seems to handle this OK by inferring the type column-wise ... – Ben Bolker Feb 11 '20 at 19:58
  • there doesn't seem to be anything against rf4180 here: `Each field **may or may not** be enclosed in double quotes" (emphasis added). – Ben Bolker Feb 11 '20 at 20:06
  • 1
    I use flat files a ton in my daily workflows and have found a need to filter out records based on complex queries. I could not find anything worth while so I wrote such a utility in R and I want the output to follow the most common practices of such tools by avoiding superfluous quotes. FYI - my n00b R script is here: https://github.com/JavaScriptDude/rfilter – Timothy C. Quinn Feb 11 '20 at 20:11

1 Answers1

2

You can write a simple function to construct a csv by converting the data frame to a character matrix, escaping any double quotes then quoting any strings containing commas or line breaks. You then add the column names and write as csv with writeLines

You can even choose a custom delimiter, which will work as long as it is not esoteric enough to be misinterpreted as a regex.

write_unquoted <- function(df, path, delim = ",")
{
  regexp <- paste0(delim, "|\n")
  x <- as.matrix(df) 
  x[grep("\"", x)] <- paste0("\"", gsub("\"", "\"\"", x[grep("\"", x)]), "\"")
  x[grep(regexp, x)]  <- paste0("\"", x[grep(regexp, x)], "\"")
  x <- c(paste0(colnames(x), collapse = delim), apply(x, 1, paste0, collapse = delim))
  writeLines(x, path)
}

So if we start with your example:

df
#>   c_numeric c_str                 c_str_spec
#> 1        11  r1c2       r1c3 nothing special
#> 2        21  r2c2            r2c3,with delim
#> 3        31  r3c2 r3c3\nwith carriage return
#> 4        41  r4c2     r3c3"with double quote

and we do

write_unquoted(df, "my.csv")

We can see it faithfully stores the data frame:

identical(read.csv("my.csv"),  df)
#> [1] TRUE

and if we look at the produced csv, it looks like this:

c_numeric,c_str,c_str_spec
11,r1c2,r1c3 nothing special
21,r2c2,"r2c3,with delim"
31,r3c2,"r3c3
with carriage return"
41,r4c2,"r3c3""with double quote"

that is, only quoted when needed.

I don't know if there are any counterexamples where this simple method isn't RFC4180 compatible.

Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
  • Thanks Allan. Any chance you could update to allow delim as a parameter of write_unquoted() so it will be compatible other delimiters like tabs? – Timothy C. Quinn Feb 11 '20 at 22:06
  • 1
    Awesome! I'll let this soak for a few days before marking as answered as I'm not technically strong enough to visually assess and don't have time or R knowledge to build a test suite to verify. Hopefully someone else is willing to kick the tires on testing this thoroughly for any possible failure modes ;^] – Timothy C. Quinn Feb 11 '20 at 22:22
  • 1
    I was searching for exhaustive write.table testing under https://github.com/wch/r-source but I could not find anything there that was doing any really hard testing of write.table. I heard that Golang has great testing and confirmed that they have a pretty broad test suite for csv: https://github.com/golang/go/blob/master/src/encoding/csv/writer_test.go . – Timothy C. Quinn Feb 12 '20 at 01:56
  • Discovered a potential issue with using other delims. For instance the pipe will not work as-is and would require an escape backslash. I suspect that tab delim would not work as well. Not sure if glob2rx() would work in translating to correct regex. – Timothy C. Quinn Feb 14 '20 at 01:57