5

I have a 5gig csv file (also as a sas datafile, if it would be easier) which I need to put into a sql database so I can work with it in R.

The variables names are all contained in the first observation line and are double quoted. There are 1000+ variables some of numeric others character (though some of the character variables are strings of numerals, but I'm not too worried about it I can fix it in R).

My question is how can I import the csv file into a new table in my database with minimal pain?

I've found things saying to create your table first (which includes specifying all the variables, of which I have 1000+) and then using ".import file table" to bring in the data. Or, to use some gui import wizard, which is not an option for me.

Sorry if this is sql 101 but thanks for the help.

MaDMaD Mad
  • 193
  • 1
  • 7
  • 1
    How much RAM do you have to work with? – Matt Parker Dec 03 '10 at 23:20
  • We just switched to a new research cluster so I'll have to check with the admin on the specifics but I believe with the default job submission options I get 16 GB, and of course I can request more as needed. – MaDMaD Mad Dec 05 '10 at 17:31
  • just be sure to check also the option of leaving everything in memory. save the workspace from time to time from your program. 16GB is more than enough. Could be much faster and easier, and you have even the option to request more RAM if needed from you operator. (and you are running LINUX, right?) – mrsteve Dec 05 '10 at 21:22
  • I don't think I understand you correctly. I can import the data into a workspace, sure and save and reload it as needed, but I can't leave the program loaded in memory between sessions. I'm gonna be working with this dataset for a while and loading the workspace will take a little more time than I'd like to spend waiting for a workspace to load. And yes, we are using red hat now, used to be solaris. – MaDMaD Mad Dec 06 '10 at 20:45

2 Answers2

9

Here's my workflow:

library("RSQLite")
setwd("~/your/dir")
db <- dbConnect(SQLite(), dbname="your_db.sqlite") ## will make, if not present
field.types <- list(
        date="INTEGER",
        symbol="TEXT",
        permno="INTEGER",
        shrcd="INTEGER",
        prc="REAL",
        ret="REAL")
dbWriteTable(conn=db, name="your_table", value="your_file.csv", row.names=FALSE, header=TRUE, field.types=field.types)
dbGetQuery(db, "CREATE INDEX IF NOT EXISTS idx_your_table_date_sym ON crsp (date, symbol)")
dbDisconnect(db)

The field.types isn't necessary. RSQLite will guess from the header if you don't provide this list. The index isn't required either, but will speed up your queries later on (if you index the correct column for your queries).

I've been learning a lot of this stuff here on SO, so if you check my questions asked/answered on SQLite, you may find some tagential stuff.

Richard Herron
  • 9,760
  • 12
  • 69
  • 116
  • +1. Notice that if your comma separated file has a different extension (sometimes...), `dbWriteTable` will fail. Just rename to `.csv`. – Ryogi Oct 28 '11 at 00:23
4

Look at the "read.csv.sql" function in the sqldf package.

This converts a csv file into an SQLite database, then reads it into R, you should be able to keep the intermediate database to use for your purposes.

Greg Snow
  • 48,497
  • 6
  • 83
  • 110
  • 2
    You can find an example of using sqldf to read a csv file into an sqlite database in example 13d on the sqldf home page: http://code.google.com/p/sqldf/#Example_13._read.csv.sql_and_read.csv2.sql – G. Grothendieck Dec 04 '10 at 04:34