2

I am trying to load a large-ish csv file into a SQL lite database using the RSQLite package (I have also tried the sqldf package). The file contains all UK postcodes and a variety of lookup values for them.

I wanted to avoid loading it into R and just directly load it into the database. Whilst this is not strictly necessary for this task, I want to do so in order to have the technique ready for larger files which won't fit in memory should I have to handle them in the future.

Unfortunately the csv is provided with the values in double quotes and the dbWriteTable function doesn't seem able to strip them or ignore them in any form. Here is the download location of the file: http://ons.maps.arcgis.com/home/item.html?id=3548d835cff740de83b527429fe23ee0

Here is my code:

# Load library
library("RSQLite")

# Create a temporary directory
tmpdir <- tempdir()

# Set the file name
file <- "data\\ONSPD_MAY_2017_UK.zip"

# Unzip the ONS Postcode Data file
unzip(file, exdir = tmpdir )

# Create a path pointing at the unzipped csv file
ONSPD_path <- paste0(tmpdir,"\\ONSPD_MAY_2017_UK.csv")

# Create a SQL Lite database connection
db_connection <- dbConnect(SQLite(), dbname="ons_lkp_db")

# Now load the data into our SQL lite database
dbWriteTable(conn = db_connection,
              name = "ONS_PD",
              value = ONSPD_path,
              row.names = FALSE,
              header = TRUE,
              overwrite = TRUE
              )

# Check the data upload
dbListTables(db_connection)
dbGetQuery(db_connection,"SELECT pcd, pcd2, pcds from ONS_PD  LIMIT 20")

Having hit this issue, I found a reference tutorial (https://www.r-bloggers.com/r-and-sqlite-part-1/) which recommended using the sqldf package but unfortunately when I try to use the relevant function in sqldf (read.csv.sql) then I get the same issue with double quotes.

This feels like a fairly common issue when importing csv files into a sql system, most import tools are able to handle double quotes so I'm surprised to be hitting an issue with this (unless I've missed an obvious help file on the issue somewhere along the way).

EDIT 1

Here is some example data from my csv file in the form of a dput output of the SQL table:

structure(list(pcd = c("\"AB1 0AA\"", "\"AB1 0AB\"", "\"AB1 0AD\"", 
"\"AB1 0AE\"", "\"AB1 0AF\""), pcd2 = c("\"AB1  0AA\"", "\"AB1  0AB\"", 
"\"AB1  0AD\"", "\"AB1  0AE\"", "\"AB1  0AF\""), pcds = c("\"AB1 0AA\"", 
"\"AB1 0AB\"", "\"AB1 0AD\"", "\"AB1 0AE\"", "\"AB1 0AF\"")), .Names = c("pcd", 
"pcd2", "pcds"), class = "data.frame", row.names = c(NA, -5L))

EDIT 2

Here is my attempt using the filter argument in sqldf's read.csv.sql function (note that Windows users will need rtools installed for this). Unfortunately this still doesn't seem to remove the quotes from my data, although it does mysteriously remove all the spaces.

library("sqldf")
sqldf("attach 'ons_lkp_db' as new")
db_connection <- dbConnect(SQLite(), dbname="ons_lkp_db")
read.csv.sql(ONSPD_path,
              sql = "CREATE TABLE ONS_PD AS SELECT * FROM file",
              dbname = "ons_lkp_db",
              filter = 'tr.exe -d ^"'
              )

dbGetQuery(db_connection,"SELECT pcd, pcd2, pcds from ONS_PD  LIMIT 5")

Also, thanks for the close vote from whoever felt this wasn't a programming question in the scope of Stack Overflow(?!).

Tumbledown
  • 1,887
  • 5
  • 21
  • 33
  • See FAQ #13 on the sqldf github page. Also the help file for `read.csv.sql` Please review [mcve]. – G. Grothendieck Jul 17 '17 at 11:25
  • Thanks, I'll try the faq advice once I have installed rtools. Could you point me to where my example needs improving?? – Tumbledown Jul 17 '17 at 11:33
  • For others following: https://github.com/ggrothendieck/sqldf#13-how-does-one-deal-with-quoted-fields-in-readcsvsql – Tumbledown Jul 17 '17 at 11:36
  • Show a small subset of input lines right in the question to make it self contained or if they are long then shorten them sufficiently to keep the main idea without overloading the question with large data. – G. Grothendieck Jul 17 '17 at 11:59
  • I'll add this a little later when I have more time. I've just tried the suggestion from the FAQ and still no luck: Error in utils::read.table(value, sep = sep, header = header, skip = skip, : no lines available in input Error in rsqlite_send_query(conn@ptr, statement) : no such table: file – Tumbledown Jul 17 '17 at 17:06
  • My syntax works fine if I take out the filter argument. I'll update my post later to show precisely what I've tried. – Tumbledown Jul 17 '17 at 17:07
  • Ignore previous comments, my environment variables hadn't updated. Data now loads with the filter command but still has quotes. – Tumbledown Jul 17 '17 at 17:16
  • Just a note that after adding filter = 'tr.exe -d ^" to the read.csv.sql function it still doesn't seem to remove double quotes but does remove spaces. – Tumbledown Jul 24 '17 at 09:07
  • Trying it in Windows 10, it seems you need single quotes around `^"` . Also read [mcve]. – G. Grothendieck Jul 24 '17 at 12:46
  • To provide a reproducible example in this case you will need to show *input* lines from the csv file. – G. Grothendieck Jul 24 '17 at 13:15

3 Answers3

2

The CSV importer in the RSQLite package is derived from the sqlite3 shell, which itself doesn't seem to offer support for quoted values when importing CSV files (How to import load a .sql or .csv file into SQLite?, doc). You could use readr::read_delim_chunked():

callback <- function(data) {
  name <- "ONS_PD"
  exists <- dbExistsTable(con, name)
  dbWriteTable(con, name, data, append = exists)
}

readr::read_delim_chunked(ONSPD_path, callback, ...)

Substitute ... with any extra arguments you need for your CSV file.

krlmlr
  • 25,056
  • 14
  • 120
  • 217
  • Thanks for this, do you have any idea on the speed of this for large files? I'm hoping for a solution to the issue I'm having with filtering in the read.csv.sql function but will try your suggestion when I get chance next. – Tumbledown Jul 24 '17 at 09:10
  • @Tumbledown: readr is pretty fast; while `fread()` from data.table is reported to be even faster, it seems to have fewer options for parsing "complicated" CSV files, and I haven't found an option to add a callback. `dbWriteTable()` uses prepared statements and is quite fast too. I don't know how this compares to sqldf. – krlmlr Jul 24 '17 at 09:44
1

Use read.csv.sql from the sqldf package with the filter argument and provide any utility which strips out double quotes or which translates them to spaces.

The question does not provide a fully reproducible minimal example but I have provided one below. If you are using read.csv.sql in order to pick out a subset of rows or columns then just add the appropriate sql argument to do so.

First set up the test input data and then try any of the one-line solutions shown below. Assuming Windows, ensure that the tr utility (found in R's Rtools distribution) or the third party csvfix utility (found here and for Linux also see this) or the trquote2space.vbs vbscript utility (see Note at end) is on your path:

library(sqldf)
cat('a,b\n"1","2"\n', file = "tmp.csv")

# 1 - corrected from FAQ
read.csv.sql("tmp.csv", filter = "tr.exe -d '^\"'")

# 2 - similar but does not require Windows cmd quoting
read.csv.sql("tmp.csv", filter = "tr -d \\42")

# 3 - using csvfix utility (which must be installed first)
read.csv.sql("tmp.csv", filter = "csvfix echo -smq")

# 4 - using trquote2space.vbs utility as per Note at end
read.csv.sql("tmp.csv", filter = "cscript /nologo trquote2space.vbs")

any of which give:

  a b
1 1 2

You could also use any other language or utility that is appropriate. For example, your Powershell suggestion could be used although I suspect that dedicated utilities such as tr and csvfix would run faster.

The first solution above is corrected from the FAQ. (It did work at the time the FAQ was written many years back but testing it now in Windows 10 it seems to require the indicated change or possibly the markdown did not survive intact from the move from Google Code, where it was originally located, to github which uses a slightly different markdown flavor.)

For Linux, tr is available natively although quoting differs from Windows and can even depend on the shell. csvfix is available on Linux too but would have to be installed. The csvfix example shown above would work identically on Windows and Linux. vbscript is obviously specific to Windows.

Note: sqldf comes with a mini-tr utility written in vbscript. If you change the relevant lines to:

Dim sSearch  : sSearch  = chr(34)
Dim sReplace : sReplace = " "

and change the name to trquote2space.vbs then you will have a Windows specific utility to change double quotes to spaces.

G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
0

Honestly I could not find anything to solve this problem. sqldf documentation tells "so, one limitation with .csv files is that quotes are not regarded as special within files so a comma within a data field such as "Smith, James" would be regarded as a field delimiter and the quotes would be entered as part of the data which probably is not what is intended"

So, It looks like there is no solution as far as I know.

One possible suboptimal approach (other then obvious find and replace in text editor) is to use SQL commands like this

dbSendQuery(db_connection,"UPDATE ONS_PD SET pcd = REPLACE(pcd, '\"', '')")
Federico Manigrasso
  • 1,130
  • 1
  • 7
  • 11
  • 1
    I'm just trying the rtools suggestion now, failing that there might be a way you can strip them by calling a windows powershell script on the csv prior to loading. All a bit fiddly though. – Tumbledown Jul 17 '17 at 17:01
  • That quote from the documentation only applies if you don't use the `filter` command. If you use the `filter` command you should be able to transform the input in an arbitrary fashion so if it is possible you should be able to do it If you want more help you will need to provide a self contained reproducible example. We don't know what you have or what you did. – G. Grothendieck Jul 17 '17 at 23:52