0

R version: 2.12.2 / OS: Windows XP

Data that I need to query and manipulate is stored in a SPSS .sav file (size ~ 112MB) and I have a very basic knowledge of SPSS syntax. I am much more comfortable with R (and generally prefer it for data manipulation and analysis), so I saved the SPSS file as a tab-delimited file (size ~ 85MB) and have attempted to query it using functionality from the RODBC package.

Using the 'Data Sources (ODBC)' application within the Administrative Tools group on Windows Control Panel, I setup a database (Windows folder) which has tables (tab-delimited files) with the Microsoft Text Driver.

The current table of interest has a 'Date' field (called 'DateTime') where dates are formatted '%Y/%m/%d'. When I first tried to query the database with conditions on this field, I was returned a result set with zero observations. So, I changed 'DateTime' to datatype 'Character' and tried my query again (to no avail).

I am able to successfully query the database, i.e.

sqlQuery(channel = cn, query = "SELECT * FROM large_table.txt", as.is = TRUE, stringsAsFactors = FALSE, max = 20)

returns results. The following queries return result sets with zero observations (I have verified that there is at least one record with the the value '2011/04/01' in its 'DateTime' field).

sqlQuery(channel = cn, query = "SELECT DateTime FROM large_table.txt WHERE DateTime = {d '2011-04-01'}", as.is = TRUE, stringsAsFactors = FALSE)

sqlQuery(channel = cn, query = "SELECT DateTime FROM large_table.txt WHERE DateTime = '2011/04/01'", as.is = TRUE, stringsAsFactors = FALSE)

Can anyone tell me why the result set has no observations?

Any suggestions to improve my database connection or query are greatly appreciated. I prefer to use the RODBC package instead of another database-query package, but am willing to try something new if necessary.

Jubbles
  • 4,450
  • 8
  • 35
  • 47
  • Any reason why you do not read either the SPSS file or the tab delimited file into a data frame and query it using either data frame syntax or `sqldf`? – ROLO Oct 19 '11 at 17:50
  • See comment below. The end result is to create a dataframe that is the result of merging three or more data files (or subsets of them) currently stored in SPSS format. I eventually run into memory issues before I can create the 'final' data frame. – Jubbles Oct 20 '11 at 15:57
  • If you have a database available or are willing to install one the SPSS function "Export to database" could also be a possibility. You can then just query the generated database tables in R afterwards. The memisc package (@paulb20's answer) sounds quite interesting too. – ROLO Oct 21 '11 at 07:29
  • 1
    I just stumbled upon [this](http://stackoverflow.com/questions/1727772/quickly-reading-very-large-tables-as-dataframes-in-r/1820610#1820610) answer here on SO. Maybe that helps. – ROLO Oct 21 '11 at 07:39

2 Answers2

3

I'd use the 'memisc' package for this. set up spss.system.file() importers for your 3 SPSS files, subset using the memisc version to load in to R just the variables that are needed, including handling user-missing levels.

There are some oddities around how SPSS files are set up - it is fairly frequent for SPSS files with numeric variables to have user-missings set, which may need sorting at the R end into an interval scaling with appropriate NAs.

paulb20
  • 191
  • 3
  • Thank you very much for the suggestion of 'memisc'. About two years ago, I briefly used this package, but had forgotten about it since. It really does the trick with SPSS files - it allows you to peek at the columns (and their data types and labels) and read a subset in as a data frame. – Jubbles Oct 21 '11 at 22:06
2

You can save yourself a lot of hassle by reading the SPSS file directly into R.

library(foreign)
dat <- read.spss("filename.sav", to.data.frame=TRUE)

This will return a data.frame that you can analyse using all of the normal R tools.

As a side benefit, you will also have the questionnaire text in the attributes of your data.frame.

See ?read.spss for more details.

Andrie
  • 176,377
  • 47
  • 447
  • 496
  • One caveat before I give a response to your suggestion - I am trying to merge data from three or more different data sets (all currently stored in SPSS format). I have tried this before and have run into R memory issues. In my experience, it's not a viable option. – Jubbles Oct 20 '11 at 14:46
  • If possible, I'd love to establish a connection to the SPSS file, and query it through the connection. – Jubbles Oct 20 '11 at 15:41