4

I have been unable to find documentation on how RSQLite handles factors. From a quick test (see below), it looks like they are converted to character.

Question 1: Is there any way to preserve them as factors? I can think of some kludgy ways (mostly involving a separate table or .Rdata file storing the factor levels), but it seems like there should be a standard and therefore more maintainable way of doing this.

Question 2: If not RSQLite, than some other database or database-like package? My use case here is simple: append in a bunch of large (2-5mm row X 550 column) data.frames as each is processed to build a giant database, then be able to select only the rows I want from that database to bring into a data.table and work on.

library(RSQLite)
# Create
db <- dbConnect( SQLite(), dbname="~/temp/test.sqlite" )
# Write test
set.seed(1)
testDat <- data.frame(x=runif(1000),y=runif(1000),g1=sample(letters[1:10],1000,replace=TRUE),g2=rep(letters[1:10],each=100),g3=factor( sample(letters[1:10],1000,replace=TRUE) ))
if(dbExistsTable(db,"test")) dbRemoveTable(db,"test")
dbWriteTable( conn = db, name = "test", value = testDat, row.names=FALSE )
# Read test
testRecovery <- dbGetQuery(db, "SELECT * FROM test")
testSelection <- dbGetQuery(db, "SELECT * FROM test WHERE g3=='h' OR g3=='e' ")
# Close
dbDisconnect(db)
Ari B. Friedman
  • 71,271
  • 35
  • 175
  • 235
  • Side note: I think you konw that [character is often preferred to factor in data table fo key](http://stackoverflow.com/questions/18304760/why-character-is-often-preferred-to-factor-in-data-table-for-key). And what is the dbType equivalent for factors? integer? I mean that you should store the column type information to preserve factor when you read the table. – agstudy Nov 04 '13 at 11:37
  • 2
    To the best of my knowledge, there is currently no way to do this in RSQLite or any DBI based package. The challenge is how to store the metadata about factors. Also, the downsides of storing them as characters are not great. – hadley Nov 04 '13 at 13:03
  • @hadley My concern is that the reference category might change if I append a data.frame on that has an extra (and alphabetically antecedent) level in the character string. At least with factors that would return an error. – Ari B. Friedman Nov 04 '13 at 14:00
  • @AriB.Friedman if your models are sensitive to the reference level, then something is wrong. – hadley Nov 04 '13 at 18:15
  • @hadley Agreed. I should have been more clear. Suppose I run the regression on one machine where the reference level is Black and save the model. Then later I add some data which causes the new reference level to switch. The prediction function expects there to be colorRed, colorBlue, etc. in the model matrix. But now there's a colorBlack because I added "Aubergine" to the database which became the first level when it was converted back to a factor. I don't care about Aubergine being there because I'm subsetting to avoid those years of data. This is admittedly paranoid, but.... – Ari B. Friedman Nov 04 '13 at 18:24

2 Answers2

6

It looks pretty simple to me: factor is a concept only S and R know. Full stop.

So to get them into a DB and back, you need to write mappers. Either be simplistic and do everything as.character (and assume most DB backend will hash strings just as R does). Or be DB-centric and split the factor into just the (unsigned) (and possibly short) integers, and the labels.

Dirk Eddelbuettel
  • 360,940
  • 56
  • 644
  • 725
3

Ok, I wrote some wrappers as per @DirkEddelbuettel's suggestion. Comments appreciated.

#' Write a table via RSQLite with factors stored in another table
#' Handles data.tables efficiently for large datasets
#' @param conn The connection object (created with e.g. dbConnect)
#' @param name The name of the table to write
#' @param value The data.frame to write to the database
#' @param factorName The base name of the tables to store the factor labels in in the SQLite database (e.g. if factorName is "_factor_" and the data.frame in value contains a factor column called "color" and the name is "mytable" then dbWriteFactorTable will create a table called mytable_factor_color which will store the levels information)
#' @param \dots Options to pass along to dbWriteTable (e.g. append=TRUE)
#' @return A boolean indicating whether the table write was successful
dbWriteFactorTable <- function( conn, name, value, factorName="_factor_", ... ) {
  # Test inputs
  stopifnot(class(conn)=="SQLiteConnection")
  stopifnot(class(name)=="character")
  stopifnot("data.frame" %in% class(value))
  stopifnot(class(factorName)=="character")
  if( grepl("[.]",factorName) ) stop("factorName must use valid characters for SQLite")
  if( "data.table" %in% class(value) ) dt <- TRUE # Is value a data.table, if so use more efficient methods
  # Convert factors to character
  factorCols <- names( Filter( function(x) x=="factor", vapply( value, class, "" ) ) )
  if(length(factorCols>0)) {
    for( cl in which( colnames(value) %in% factorCols ) ) {
      cn <- colnames(value)[cl]
      factorTable <- data.frame( levels=levels(value[[ cn ]]) )
      factorTable$levelKey <- seq(nrow(factorTable))
      fctNm <- paste0(name,factorName,cn)
      dbWriteTable( conn = conn, name = fctNm, value = factorTable, row.names=FALSE, overwrite=TRUE )
      if( dt )  set( x=value, j=cl, value=as.character(value[[ cn ]]) )
    }
    if( !dt )  value <- japply( value, which( colnames(value) %in% factorCols ), as.character )
  } else {
    warning("No factor columns detected.")
  }
  dbWriteTable( conn = conn, name = name, value = value, ... )
}

#' Read a table via RSQLite with factors stored in another table
#' @param conn The connection object (created with e.g. dbConnect)
#' @param name The name of the table to read
#' @param query A character string containing sequel statements to be appended onto the query (e.g. "WHERE x==3")
#' @param dt Whether to return a data.table vs. a plain-old data.frame
#' @param factorName The base name of the tables to store the factor labels in in the SQLite database (e.g. if factorName is "_factor_" and the data.frame in value contains a factor column called "color" and the name is "mytable" then dbWriteFactorTable will expect there to be a table called mytable_factor_color which holds the levels information)
#' @param \dots Options to pass along to dbGetQuery
#' @return A data.table or data.frame
dbReadFactorTable <- function( conn, name, query="", dt=TRUE, factorName="_factor_", ... ) {
  # Test inputs
  stopifnot(class(conn)=="SQLiteConnection")
  stopifnot(class(name)=="character")
  stopifnot(class(factorName)=="character")
  if( grepl("[.]",factorName) ) stop("factorName must use valid characters for SQLite")
  # Read main table
  if( dt ) {
    value <- as.data.table( dbGetQuery( conn, paste("SELECT * FROM",name,query), ... ) )
  } else {
    value <- dbGetQuery( conn, paste("SELECT * FROM",name,query), ... )
  }
  # Convert factors to character
  factorCols <- sub( paste0("^.*",name,factorName,"(.+)$"), "\\1", 
    Filter( Negate(is.na), 
      str_extract( dbListTables( conn ), paste0(".*",name,factorName,".*") ) 
    )
  )
  if(length(factorCols>0)) {
    for( cn in factorCols ) {
      fctNm <- paste0(name,factorName,cn)
      factorTable <- dbGetQuery( conn, paste0("SELECT * FROM ",fctNm) )
      if( dt ) {
        cl <- which( colnames(value) %in% cn )
        set( x=value, j=cl, value=factor( value[[ cn ]], levels=factorTable$levels ) )
      } else {
        value[[ cn ]] <- factor( value[[ cn ]], levels=factorTable$levels )
      }
    }
  } else {
    warning("No factor columns detected.")
  }
  value
}

And a quick example:

db <- dbConnect( SQLite(), dbname="~/temp/test.sqlite" )
set.seed(1)
n <- 1000
testDat <- data.frame(key=seq(n), x=runif(n),y=runif(n),g1=sample(letters[1:10],n,replace=TRUE),g2=rep(letters[1:10],each=n/10),g3=factor( sample(letters[1:10],n,replace=TRUE) ))
if(dbExistsTable(db,"test")) dbRemoveTable(db,"test")
dbWriteFactorTable( conn = db, name = "test", value = as.data.table(testDat), row.names=FALSE )
dbReadFactorTable( conn = db, name = "test" )
dbReadFactorTable( conn = db, name = "test", query="WHERE g3=='a'" )
Ari B. Friedman
  • 71,271
  • 35
  • 175
  • 235