sqldf, csv, and fields containing commas

Question

Took me a while to figure this out. So, I am answering my own question.

You have some .csv, you want to load it fast, you want to use the sqldf package. Your usual code is irritated by a few annoying fields. Example:

1001,     Amy,9:43:00, 99.2
1002,"Ben,Jr",9:43:00, 99.2
1003,"Ben,Sr",9:44:00, 99.3

This code only works on *nix systems.

library(sqldf)
system("touch temp.csv")
system("echo '1001, Amy,9:43:00, 99.2\n1002,\"Ben,Jr\",9:43:00, 99.2\n1003,\"Ben,Sr\",9:44:00, 99.3' > temp.csv")

If try to read with

x <- read.csv.sql("temp.csv", header=FALSE)

R complains

Error in try({ : 
  RS-DBI driver: (RS_sqlite_import: ./temp.csv line 2 expected 4 columns of data but found 5)

The sqldf-FAQ.13 solution doesn't work either:

x <- read.csv.sql("temp.csv", filter = "tr -d '\"' ", header=FALSE)

Again, R complains

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  line 1 did not have 5 elements

In fact, the filter only removes double quotes.

So, how to proceed?

I have clarified and improved the FAQ and have included a gawk version of your perl code with a link bank here. — G. Grothendieck, Nov 10 '11 at 23:15
I initially had confused you with someone else and you must have seen it just before I had fixed the name. — G. Grothendieck, Nov 11 '11 at 00:16

Ryogi · Accepted Answer · 2011-11-11T20:20:41.743

Perl and regexes to the rescue. Digging through SO, and toying with regexes here, it is not too hard come up with the right one:

 s/(\"[^\",]+),([^\"]+\")/$1_$2/g

which matches "...,...", here the dots are anything but double quotes and commas, and substitues the comma with an underscore. A perl one-liner is the right filter to pass to sqldf:

x <- read.csv.sql("temp.csv", 
        filter = "perl -e 's/(\"[^\",]+)_([^\"]+\")/$1_$2/g'", 
        header=FALSE)

Here is the dataframe x

> x
    V1       V2      V3   V4
1 1001      Amy 9:43:00 99.2
2 1002 "Ben_Jr" 9:43:00 99.2
3 1003 "Ben_Sr" 9:44:00 99.3

Now, DYO cosmesis on strings ...

EDIT: The regex above only replaces the first occurrence of a comma in the field. To replace all the occurrencies use this

s{(\"[^\",]+),([^\"]+\")}{$_= $&, s/,/_/g, $_}eg

What's different?

I replaced the delimiters / with {};
The option e at the very end, instructs the parser to interpret the replacement field as perl code;
The replecement is a simple regex replace, that substitutes all "," with "_" within the matched substring $&.

An example:

system("touch temp.csv")
system("echo '1001, Amy,9:43:00, 99.2\n1002,\"Ben,Jr,More,Commas\",9:43:00, 99.2\n1003,\"Ben,Sr\",9:44:00, 99.3' > temp.csv")

The file temp.csv looks like:

1001,                 Amy,9:43:00, 99.2
1002,"Ben,Jr,More,Commas",9:43:00, 99.2
1003,            "Ben,Sr",9:44:00, 99.3

And can be read with

x <- read.csv.sql("temp.csv", 
       filter = "perl -p -e 's{(\"[^\",]+),([^\"]+\")}{$_= $&, s/,/_/g, $_}eg'", 
       header=FALSE)
> x
    V1                   V2      V3   V4
1 1001                  Amy 9:43:00 99.2
2 1002 "Ben_Jr_More_Commas" 9:43:00 99.2
3 1003             "Ben_Sr" 9:44:00 99.3

On `adding filter` argument in read.csv.sql function it gives me error `Error in file(file, "rt") : cannot open the connection In addition: Warning message:In file(file, "rt") : cannot open file 'C:\Users\user\AppData\Local\Temp\Rtmpcbqrz6\filee4819344504': No such file or directory.` I am writing the cmd in RStudio `x <- read.csv.sql("D:/d/myfile.csv", sql="Select * from file where agent='agnt1455' ",filter = "perl -e 's/(\"[^\",]+)_([^\"]+\")/$1_$2/g'") ` — Akki, Oct 22 '17 at 13:51

score 0 · Answer 2 · answered Jun 19 '15 at 19:55

0

For windows, sqldf now comes with trcomma2dot.vbs which does this by default with read.csv2.sql . Although found it to be slow for very large data.(>1million rows)

It mentions about "tr" for non-windows based system but I could not try it.

answered Jun 19 '15 at 19:55

shetaa

1
1

1

Note that `trcomma2dot` translates all commas to dots, not just ones in quotes. – G. Grothendieck Jun 19 '15 at 21:00

sqldf, csv, and fields containing commas

2 Answers2

Linked