Dealing with commas in a CSV file in sqldf

Question

I am following up on my question here sqldf returns zero observations with a reproducible example.

I found that the problem is probably from the "comma" in one of the cells ("1,500+") and I think that I have to use a filter as suggested here sqldf, csv, and fields containing commas, but I am not sure how to define my filter. Below is the code:

library(sqldf)

df <- data.frame("a" = c("8600000US01770" , "8600000US01937"),
             "b"= c("1,500+" , "-"),
             "c"= c("***" , "**"),
             "d"= c("(x)" , "(x)"),
             "e"= c("(x)" , "(x)"),
             "f"= c(992 , "-"))
       
write.csv(df, 'df_to_read.csv')  

# 'df_to_read.csv' looks like this:
# "","a","b","c","d","e","f"
# 1,8600000US01770,1,500+,***,(x),(x),992
# 2,8600000US01937,-,**,(x),(x),-

Housing <- file("df_to_read.csv")
Housing_filtered <- sqldf('SELECT * FROM Housing', file.format = list(eol="\n"))

When I run this code, I get the following error:

Error in connection_import_file(conn@ptr, name, value, sep, eol, skip) : RS_sqlite_import: df_to_read.csv line 2 expected 7 columns of data but found 8

Have you tried opening up `df_to_read.csv` to see what it looks like after you have written it? Can you also include this content in your question? — Tim Biegeleisen, Jun 17 '18 at 02:34
remove the `"` from the contents of `df_to_read.csv` file in your post. — Mankind_008, Jun 17 '18 at 19:54

Katia · Answer 1 · 2018-06-17T04:14:14.833

The problem comes from reading the column created by df$b. The first value in that column contains comma and so sqldf() function treats it as a separator. One way to deal with this is to either remove comma or use some other symbol (like space).You can also use read.csv2.sql function:

library(sqldf)

df <- data.frame("a" = c("8600000US01770" , "8600000US01937"),
                 "b"= c("1,500+" , "-"),
                 "c"= c("***" , "**"),
                 "d"= c("(x)" , "(x)"),
                 "e"= c("(x)" , "(x)"),
                 "f"= c("992" , "-"))

write.csv(df, 'df_to_read.csv',row.names = FALSE )


Housing_filtered <- read.csv2.sql("df_to_read.csv", sql = "select * from file", header=TRUE)

`read.csv2.sql` takes forever to open my large file. Regarding using other separators or removing commas, should I do that manually in my CSV file? I have a large file and it is not easy to modify it manually. The df in my question was a small example of what I have. — Hessam, Jun 17 '18 at 15:21

Mankind_008 · Answer 2 · 2018-06-17T19:57:36.260

0

Best way would be to clean your file once, so that you don't need to worry later again in your analysis for the same issue. This should get you going:

Housing <- readLines("df_to_read.csv")                            # read the file

n <- 6             # number of separators expected = number of columns expected - 1

library(stringr)
ln_idx <- ifelse(str_count(Housing, pattern = ",") == n, 0 , 1)
which(ln_idx == 1)               # line indices with issue, includes the header row
#[1] 2

Check for the specific issues and write back to you file, at the same indices. for eg line (2):

Housing[2]
#[1] "1,8600000US01770,1,500+,***,(x),(x),992"            # hmm.. extra comma

Housing[2] = "1,8600000US01770,1500+,***,(x),(x),992"     # removed the extra comma
writeLines(Housing, "df_to_read.csv")

Now the business is usual, good to go:

Housing <- file("df_to_read.csv")
Housing_filtered <- sqldf('SELECT * FROM Housing') 

# Housing_filtered 
#               a      b   c   d   e   f
# 1 8600000US01770  1500+ *** (x) (x) 992
# 2 8600000US01937      -  ** (x) (x)   -

edited Jun 17 '18 at 19:57

answered Jun 17 '18 at 04:04

Mankind_008

2,158
2
9
15

1

With this approach the following error occurs: Error in scan(file, what = "", sep = sep, quote = quote, nlines = 1, quiet = TRUE, : invalid 'sep' value: must be one byte – Katia Jun 17 '18 at 04:18
@Mankind_008: it does work well on df in my question. However, the file I was trying to imitate in my question is a large CSV file with no `"` and only `,`. Using `sep = '"'` does not work. Sorry my question was not clear on that. – Hessam Jun 17 '18 at 15:30
You should create an example that matches your concern exactly, to let the others help you. No worries. The best way for you is to clean the file once, I have updated the answer to get you started. – Mankind_008 Jun 17 '18 at 19:47
@Mankind_008: Thanks. I have a large file with about 300 rows where an irregular `,` exist in a cell (for example: `1,500+`). – Hessam Jun 17 '18 at 22:05
does it belongs to the same column for all 300 and or the position is irregular. And how many extra commas in each row? – Mankind_008 Jun 17 '18 at 22:09
@Mankind_008: All commas are in one specific column. There are one or two commas (for example: `1,500+` and `1,500,000+`). What I did was that I opened the CSV file in Excel and deleted that column and it worked. However, it was time consuming as the files is large and I have multiple files like that. I think my question is already answered by user Ryogi here https://stackoverflow.com/questions/8086559/sqldf-csv-and-fields-containing-commas?noredirect=1&lq=1 but I have no idea how to define the filter. – Hessam Jun 18 '18 at 00:35

Dealing with commas in a CSV file in sqldf

2 Answers2