4

Or a list of how to do in R things you do in SQL (or vise versa) ?

Thanks,

Tal

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Tal Galili
  • 24,605
  • 44
  • 129
  • 187
  • Like what? What kind of operations are you trying to do? – nico Jul 03 '10 at 13:10
  • Hi Nico, I refer to all operations. I am trying to see if there is something like R for SQL users (like there is the book R for SAS users) - so there would be a nice way to leverage my knowledge in R for when I'll have a stab at SQL/ – Tal Galili Jul 03 '10 at 15:02

5 Answers5

3

sqldf package could be of some help here perhaps?

There is also a talk from Joshua accompanying presentation that Shane mentioned above.

radek
  • 7,240
  • 8
  • 58
  • 83
  • Thanks Radek - that package actually gave me the idea for the question. Interesting how no one made such a thing. Maybe this should be some sort of R community project (that is - to take a bunch of SQL tasks and that all of us will compile all the ways to do them in R - something like this maybe: http://rosettacode.org/wiki/Category:Database_operations). – Tal Galili Jul 03 '10 at 19:41
  • For me that would be a bliss, since I feel much more comfortable managing data with SQL [so far]. Thanks for rosetta link - interesting. – radek Jul 05 '10 at 20:04
2

You could look at Joshua Reich's presentation on R and SQL (see page 11).

Shane
  • 98,550
  • 35
  • 224
  • 217
2

The examples section at the bottom of the help(sqldf) page in the sqldf package has quite a few SQL commands and their R counterparts.

G. Grothendieck
  • 531
  • 2
  • 2
1

It's also worth looking into the RMysQL package.

I work with very large datasets that cannot be dumped into text prior to importing in R. This package allows me to use standard mysql queries from within R to pull in subsets of my data.

Maiasaura
  • 32,226
  • 27
  • 104
  • 108
  • Thank you. I played with it about two years ago. I remember that the connection time was very long. Is this still an issue today ? – Tal Galili Jul 03 '10 at 21:12
  • I find RMySQL operations to be very slow, compared with native queries or wrappers for other languages. – neilfws Jul 04 '10 at 01:54
  • I find it reasonably fast. Although I must emphasize that it is not something you should do repetitively. It's good to have a workflow (http://stackoverflow.com/questions/1429907/workflow-for-statistical-analysis-and-report-writing/1434424#1434424) so queries only happen once and get written to .rdata files. In subsequent runs, you read the .rdata file and not run the query repeatedly. When your database changes, then you rerun steps one and two. – Maiasaura Jul 04 '10 at 05:33
  • I haven't used RMySQL, but RODBC and RJDBC and haven't found any speed problems. So maybe you could try those if RMySQL is slow. – Matti Pastell Jul 04 '10 at 10:44
1

I just started working with RMySQL recently and really like the package. I just run basic SQL queries in R itself. Most of the data re-arranging is done in several independent SQL scripts, basically some stored procedures.

I think R is a statistical package with some nice merging capability but it´s not meant to handle relational data that way. I do work a lot with micro data and have to set up non-relational datasets from these micro data (and then use R for regression analysis and plotting ggplot2 (!)) . I also do data aggregation in SQL itself before connecting to R.

I also recommend to use views (if they are fast enough for you). R accesses them like ordinary tables using the list tables statement.

Besides there´s RPostgreSQL out there, if you wanna give postgreSQL a try. I tried it once but switched to RMySQL because RPostgreSQL was so hard to setup on my Mac and after an update the config was gone. RMySQL was much easier. Back then I had to compile the package on my own, so if you run another OS, you might get a binary (or there´s a Mac OS one there by now) .

In any case there is some literature on RPostgreSQL out there that might help you even if you use RMySQL, particulary if you plan to use it for timeseries data (e.g. TSPostgreSQL).

Matt Bannert
  • 27,631
  • 38
  • 141
  • 207