-2

I have a data-frame (3 cols, 12146637 rows) called tr.sql which occupies 184Mb. (it's backed by SQL, it is the contents of my dataset which I read in via read.csv.sql)

Column 2 is tr.sql$visit_date. SQL does not allow natively representing dates as an R Date object, this is important for how I need to process the data.

Hence I want to copy the contents of tr.sql to a new data-frame tr (where the visit_date column can be natively represented as Date (chron::Date?). Trust me, this makes exploratory data analysis easier, for now this is how I want to do it - I might use native SQL eventually but please don't quibble that for now.)


Here is my solution (thanks to gsk and everyone) + workaround:

tr <- data.frame(customer_id=integer(N), visit_date=integer(N), visit_spend=numeric(N))
# fix up col2's class to be Date
class(tr[,2]) <- 'Date'

then workaround copying tr.sql -> tr in chunks of (say) N/8 using a for-loop, so that the temporary involved in the str->Date conversion does not out-of-memory, and a garbage-collect after each:

for (i in 0:7) { from <- floor(i*N/8) to <- floor((i+1)*N/8) -1 if (i==7) to <- N print(c("Copying tr.sql$visit_date",from,to," ...")) tr$visit_date[from:to] <- as.Date(tr.sql$visit_date[from:to]) gc() } rm(tr.sql) memsize_gc() ... # only 321 Mb in the end! (was ~1Gb during copying)


The problem is allocating then copying the visit_date column. Here is the dataset and code, I am having multiple separate problems with this, explanation below:

'training.csv' looks like...
customer_id,visit_date,visit_spend 
2,2010-04-01,5.97 
2,2010-04-06,12.71 
2,2010-04-07,34.52 

and code:

# Read in as SQL (for memory-efficiency)...
library(sqldf)
tr.sql <- read.csv.sql('training.csv')
gc()
memory.size()

# Count of how many rows we are about to declare
N <- nrow(tr.sql)
# Declare a new empty data-frame with same columns as the source d.f.
# Attempt to declare N Date objects (fails due to bad qualified name for Date)
# ... does this allocate N objects the same as data.frame(colname = numeric(N)) ?
tr <- data.frame(visit_date = Date(N))
tr <- tr.sql[0,]
# Attempt to assign the column - fails
tr$visit_date <- as.Date(tr.sql$visit_date)
# Attempt to append (fails)
> tr$visit_date <- append(tr$visit_date, as.Date(tr.sql$visit_date))
Error in `$<-.data.frame`(`*tmp*`, "visit_date", value = c("14700", "14705",  : 
  replacement has 12146637 rows, data has 0
  1. The second line that tries to declare data.frame(visit_date = Date(N)) fails, I don't know the correct qualified name with namespace for Date object (tried chron::Date , Dates::Date? don't work)
  2. Both the attempt to assign and append fail. Not even sure whether it is legal, or efficient, to use append on a single large column of a data-frame.

Remember these objects are big, so avoid using temporaries. Thanks in advance...

smci
  • 32,567
  • 20
  • 113
  • 146

3 Answers3

2

Try this ensuring that you are using the most recent version of sqldf (currently version 0.4-1.2).

(If you find you are running out of memory try putting the database on disk by adding the dbname = tempfile() argument to the read.csv.sql call. If even that fails then its so large in relation to available memory that its unlikely you are going to be able to do much analysis with it anyways.)

# create test data file
Lines <- 
"customer_id,visit_date,visit_spend 
2,2010-04-01,5.97 
2,2010-04-06,12.71 
2,2010-04-07,34.52"
cat(Lines, file = "trainingtest.csv")

# read it back
library(sqldf)
DF <- read.csv.sql("trainingtest.csv", method = c("integer", "Date2", "numeric"))
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • `installed.packages()` only says `sqldf "GPL-2" "2.13.1"` ? – smci Aug 06 '11 at 23:27
  • The problem is not the _read.csv.sql()_ call (which succeeds), it is the subsequent assignment involving str temporaries. See my workaround above for copying by chunks. Putting the database on disk will not fix that. – smci Aug 06 '11 at 23:35
  • `packageVersion("sqldf")` will show you your version. If its not the latest try `update.packages()` or just run the update anyways just to be sure. – G. Grothendieck Aug 06 '11 at 23:35
  • Thanks, yes my version is 0.4-1.2. – smci Aug 06 '11 at 23:37
  • 1
    You don't need that code to set the Date class since the `read.csv.sql` call in my post already does that. Also I think you are now changing the question. The question was originally how to read the data and set the classes and since you indicate that that works your question has been answered. If you have other questions post a new question if its something that is worthwhile for this forum. – G. Grothendieck Aug 06 '11 at 23:43
  • I had originally tried exactly that before I posted here: a `read.csv.sql` with colClasses. However I kept getting the date field stored as factor, which was not what I wanted. This has been part 2 of my question all along. – smci Aug 06 '11 at 23:48
  • `colClasses` is for `read.table` but not for `read.csv.sql` so one would not expect that to work. With `read.csv.sql` you must use `method`, as shown. – G. Grothendieck Aug 07 '11 at 00:11
  • What is Date2? It doesn't show in my list of packages or objects? – smci Aug 07 '11 at 01:21
  • `Date2` is documented under the `method` argument in `?sqldf` . – G. Grothendieck Aug 07 '11 at 01:26
0

It doesn't look to me like you've got a data.frame there (N is a vector of length 1). Should be simple:

tr <- tr.sql
tr$visit_date <- as.Date(tr.sql$visit_date)

Or even more efficient:

tr <- data.frame(colOne = tr.sql[,1], visit_date = as.Date(tr.sql$visit_date), colThree = tr.sql[,3])

As a side note, your title says "append" but I don't think that's the operation you want. You're making the data.frame wider, not appending them on to the end (making it longer). Conceptually, this is a cbind() operation.

Ari B. Friedman
  • 71,271
  • 35
  • 175
  • 235
  • tr _is_ a data-frame. N is only the count of rows. – smci Aug 06 '11 at 12:33
  • N is of length 1. Therefore `nrow(tr)==1`. Then you reduce it down to 0 rows with the next line. Therefore it's not surprising that you get `replacement has 12146637 rows, data has 0` when you try to do anything. If you really want to do an append operation, use `rbind()`. – Ari B. Friedman Aug 06 '11 at 12:35
  • It's very failure-prone to try to assign the entire d.f. at once `tr <- tr.sql` ; this blows out memory. Each SQL date takes 72 bytes, and they get converted to strings, taking ~600 bytes each. This tries to allocate >1Gb and fails. – smci Aug 06 '11 at 12:37
  • No, N is just a number. tr is the data-frame. You're confusing these two. I only declared N so that people would understand `data.frame(visit_date = Date(N))` is a very big allocation. – smci Aug 06 '11 at 12:40
  • Please run this line `tr <- data.frame(visit_date = Date(N))` and then run `dim(tr)` and paste the results here. Also, have you tried my solution? Does it work? – Ari B. Friedman Aug 06 '11 at 12:44
  • Can you tell me a) the fully qualified name for Date(N?) and b) the successful cbind/rbind syntax? (Yes I read the manual.) – smci Aug 06 '11 at 12:44
  • ^^ I said data.frame(visit_date = Date(N)) fails due to needing the fully-qualified name for Date! – smci Aug 06 '11 at 12:45
  • Also, are you sure you want it as a Date not POSIXlt? Date is a base package functionality. There's no fully-qualified name needed. But Date() is not a function. Try as.Date (if stored as numeric) or use as.POSIXlt . – Ari B. Friedman Aug 06 '11 at 12:46
  • @gsk3 let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/2189/discussion-between-smci-and-gsk3) – smci Aug 06 '11 at 12:47
  • 2
    It would be helpful if you make this reproducible or at least (1) display the first few lines of your file and (2) show the statement you used to read it in. This will show the first 3 lines: `for(ll in readLines("visit.csv", n = 3)) cat(ll, "\n")` – G. Grothendieck Aug 06 '11 at 15:13
  • Your second line of code could be much more clearly expressed using `transform`. – hadley Aug 06 '11 at 21:41
  • @hadley: no you can't use `transform()` as is, the SQL object cannot natively support Date, so the string blows up memory usage. – smci Aug 06 '11 at 21:56
  • @G. Grothendieck, the **original post was already 100% reproducible**. I posted you three lines from the .csv: they are indeed an int, a date and a numeric. And I had already said the read command was _read.csv.sql_ – smci Aug 06 '11 at 22:05
  • @smci Please read http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example . And do try not to be so combative towards those offering you free help.... – Ari B. Friedman Aug 06 '11 at 22:08
  • Thanks a lot gsk3, but the original example was 100% reproducible (could have generate random rows of int, date & numeric, Grothendieck's comment was not constructive. – smci Aug 06 '11 at 22:35
  • 1
    @smci You're missing the point. A good question will be 100% reproducible by the standard that someone should be able to copy and paste and reproduce the entire problem. That means you should make a sample data.frame (as you point out, you could generate random rows). Since you know the problem best, it's best for you to generate that sample data.frame. I can't count the number of times I've generated a sample data.frame in solving a problem only to have it turn out that the actual problem wasn't reproduced in the way I generated the data. You could even pull a row of dates from your SQL DB. – Ari B. Friedman Aug 06 '11 at 22:39
  • ^^ I disagree that it needs to be cut-and-paste-reproducible (that's not the standard for say Python questions on SO), but if that's the standard here among R users I'll follow it. Thanks for the URL. – smci Aug 06 '11 at 23:19
0

Try this:

tr <- data.frame(visit_date= as.Date(tr.sql$visit_date, origin="1970-01-01") )

This will succeed if your format is YYYY-MM-DD or YYYY/MM/DD. If not one of those formats then post more details. It will also succeed if tr.sql$visit_date is a numeric vector equal to the number of days after the origin. E.g:

vdfrm <- data.frame(a = as.Date(c(1470, 1475, 1480), origin="1970-01-01")  )
vdfrm
           a
1 1974-01-10
2 1974-01-15
3 1974-01-20
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Sorry but this fails (memory allocation) for the reason I stated: there is a huge temporary involved in the conversion from str `as.Date(tr.sql$visit_date, ...)` – smci Aug 06 '11 at 23:21
  • The principles could be used to do in in chunks but needing to do that is a last resort. 184 MB is not that big. Even 1GB is medium-sized potatoes if you have sufficient machine resources. (Windows XP might be a problem, admittedly, given its default 2.5 GB limit.) Tips about settings might be possible if you included anything about your OS and sessionInfo() – IRTFM Aug 07 '11 at 01:59
  • Windows Vista 32b; i386-pc-mingw32/i386 (32-bit). And [the memory limit is 2Gb](http://cran.r-project.org/bin/windows/rw-FAQ.html#There-seems-to-be-a-limit-on-the-memory-it-uses_0021). – smci Aug 07 '11 at 02:58