4

Packages like RMySQL and sqldf allow one to interface with local or remote database servers. I'm creating a portable project which involves importing sql data in cases (or on devices) which do not always have access to a running server, but which do always have access to the latest .sql dump of the database.

The goal seems simple enough: import an .sql dump into R without the involvement of a MySQL server. More specifically, I'd like to create a list of lists in which the elements correspond to any databases defined in the .sql dump (there may be multiple), and those elements in turn consist of the tables in those databases.

To make this reproducible, let's take the sample sportsdb SQL file here — if you unzip it it's called sportsdb_sample_mysql_20080303.sql.

One would think sqldf might be able to do it:

read.csv.sql('sportsdb_sample_mysql_20080303.sql', sql="SELECT * FROM addresses") Error in sqliteSendQuery(con, statement, bind.data) : error in statement: no such table: addresses

This even though there certainly is a table addresses in the dump. This post on the sqldf list mentions the same error, but no solution.

Then there is an sql.reader function in the package ProjectTemplate, which looks promising. Poking around, the source for the function can be found here, and it assumes a running database server and relies on RMySQL — not what I need.

So... we seem to be running out of options. Any help from the hivemind appreciated!

(To reiterate, I am not looking for a solution that relies on access to an SQL server; that's easy with dbReadTable from the RMySQL package. I would very much like to bypass the server and get the data straight from the .sql dump file.)

strangeloop
  • 193
  • 2
  • 12
  • was your .sql dump created by mysqldump? – chinsoon12 Apr 20 '16 at 08:53
  • Not sure — created by a backup script on a web server running MySQL. I've had no trouble importing the dumps in MySQL WorkBench, but getting a server to run locally can be a pain in the ass if all you want is is a quick import of the latest data. (And no, I haven't been able to connect to the server from R reliably; too many layers of security imposed by our sysadmins.) – strangeloop Apr 20 '16 at 18:58
  • 1
    That's a non-trivial problem. A .sql dump file contains SQL statements. If you don't have control over what created the .sql file, you don't know whether it contains CREATE TABLE statements, or just INSERT statements. And string parsing an INSERT can be problematic... you can't just stop at the first semicolon that occurs, that could be part of a value. Same thing with commas, and parens. For some very special cases, you could probably get something working. But as a general utility. I really don't think it's feasible. – spencer7593 Apr 20 '16 at 23:42
  • upvoted @spencer7593 comment. You should take note of his "Same thing with commas" comment and I am using single space in string split. just be careful and double sanity check when using it. – chinsoon12 Apr 22 '16 at 23:04

3 Answers3

2

depending on what you want to extract from the table, here is how you can play around with the data

numLines <- R.utils::countLines("sportsdb_sample_mysql_20080303.sql")
# [1] 81266

linesInDB <- readLines("sportsdb_sample_mysql_20080303.sql",n=60)

Then you can do some regex to get tables names (after CREATE TABLE), column names (between first brackets) and VALUES (lines after CREATE TABLE and between second brackets)

Reference: Reverse engineering a mysqldump output with MySQL Workbench gives "statement starting from pointed line contains non UTF8 characters" error


EDIT: in response to OP's answer, if i interpret the python script correct, it is also reading it line by line, filter for INSERT INTO lines, parse as csv, then write to file. This is very similar to my original suggestion. My version below in R. If the file size is too large, it would be better to read in the file in chunks using some other R package

options(stringsAsFactors=F)
library(utils)
library(stringi)
library(plyr)

mysqldumpfile <- "sportsdb_sample_mysql_20080303.sql"

allLines <- readLines(mysqldumpfile)
insertLines <- allLines[which(stri_detect_fixed(allLines, "INSERT INTO"))]
allwords <- data.frame(stri_extract_all_words(insertLines, " "))
d_ply(allwords, .(X3), function(x) {
    #x <- split(allwords, allwords$X3)[["baseball_offensive_stats"]]
    print(x[1,3])

    #find where the header/data columns start and end
    valuesCol <- which(x[1,]=="VALUES")
    lastCols <- which(apply(x, 2, function(y) all(is.na(y))))
    datLastCol <- head(c(lastCols, ncol(x)+1), 1) - 1

    #format and prepare for write to file
    df <- data.frame(x[,(valuesCol+1):datLastCol])
    df <- setNames(df, x[1,4:(valuesCol-1)])
    #type convert before writing to file otherwise its all strings
    df[] <- apply(df, 2, type.convert)
    #write to file
    write.csv(df, paste0(x[1,3],".csv"), row.names=F)
})
chinsoon12
  • 25,005
  • 4
  • 25
  • 35
  • This actually does not take into account lines where there are multiple entries with comma Forex :- INSERT INTO actor VALUES (1,'PENELOPE','GUINESS','2006-02-15 04:34:33'), (2,'NICK','WAHLBERG','2006-02-15 04:34:33'), (3,'ED','CHASE','2006-02-15 04:34:33')) – Vaibhav Singh Feb 08 '19 at 05:56
  • Any ways to help solve above by modifying the script ? – Vaibhav Singh Feb 08 '19 at 05:56
1

I don't think you will find a way to import a sql dump (which contains multiple tables with references) and then perform arbitrary sql queries on them within R. This would basically require the R package to run a complete database server (compatible with the one creating the dump) within R.

I would suggest exporting the tables/select statements you need as CSV from your database (see here). If you can only work from the dump and don't want to setup a server for the conversion you could use some simple regular expressions to turn the insert statements in your dump into a bunch of CSV files for the tables using a tool of your choosing like sed or awk (or even R as suggested by the other answer but that might be rather slow for this file size).

Community
  • 1
  • 1
bnord
  • 385
  • 1
  • 12
  • To be clear, I don't want to "perform arbitrary sql queries on them within R" — all I'd want is to import all of the data as a list of data frames, then forget about sql queries and just work on the data in R (when importing with RMySQL from a running local server, it becomes a "large list" of ~50 Mb, which is managable for my purposes). I guess parsing the `insert` statements to make them more CSV-like is what it'll come down to then... – strangeloop Apr 20 '16 at 19:04
1

I'll reluctantly answer my own question, using the input from +bnord and +chinsoon12 (who both contributed pieces of the puzzle).

Short answer: there is no out of the box solution. As +bnord notes, it would be preferred to fix it server-side (e.g., by exporting to CSV format with mysqldump). However, as my question indicated, I'm looking for a solution that allows me to work with the sql dump, bypassing the server.

So if we have to work with the dump, how? The hardcore, manual way is to use regular expressions to convert INSERT statements to CSV, either (1) outside R using sed and awk on the .sql text file (+bnord), or (2) inside R with grep and gsub on strings loaded with readLines (+chinsoon12).

Some good soul wrote a python script that can convert sql dumps to CSV. This requires yet another piece of (potentially non-trivial to install/maintain) software, so it's not the answer I was hoping for, but it does look like a good model in case anyone wants to reinvent the wheel in R.

For now I'll stick with my modus operandi of (on Windows) running MySQL Community Server and using WorkBench to import the dump, then talk to the local server from R. A very indirect method that is a pain in the ass because of the inscrutable access rights system of MySQL (especially annoying since it's all just there in an ASCII text file), but the only way for now, it seems. Thanks all for your input!

(If a better, more complete answer comes along I'll gladly accept that, turning this into a comment if possible.)

strangeloop
  • 193
  • 2
  • 12