1

Executive summary: *Improve the memory efficiency of the date-based extraction functions used in aggregate fn calls below; to not blow out the 1Gb memory limit. *.

I have a large dataset tr stored in a data-frame (3 cols, 12 million rows; ~200Mb)
The columns are customer_id (integer), visit_date and visit_spend(numeric) The dataset requires registration, so this is as reproducible as it can be:
Dataset looks like (full file is here, requires registration):

customer_id,visit_date,visit_spend 
2,2010-04-01,5.97 
2,2010-04-06,12.71 
2,2010-04-07,34.52
#... 12146637 rows in total

The date range is restricted between 2010-04-01 ... 2011-06-30 (14700..15155 in integer)

Here I'm asking what is the optimal representation to choose for the visit_date field. I do some aggregate calls (example code at bottom) that blow up the memory. I also use date utility fns something like what is attached at bottom here (they will need recoding for compactness, but these are the typical operations I want to do a lot of). So I need a representation for the date that avoids this.

As I see it there are three possible representations I could use for visit_date field, here are the pros and cons, wrt what I am trying to do. My aim is to get the format which does not blow up memory, and gives the least grief during these date handling operations, aggregate etc.:

  • integer or factor
    Cons:
    1) doesn't allow comparison or sort operations, hence aggregation is painful.
    2) I would need to hardcode all the date-related functions (e.g. 14700..14729 for Apr 2010) - doable but painful.
    3) Needs manual handling for graphs.
  • numeric
    Cons: blows up memory due to requiring asDate() everywhere.
  • Date
    Pros: most readable for print(), graphs and histograms; does not need manual handling treatment before graphing.
    Cons: blows up (out-of-memory fail) if I apply any string operations (format), or aggregate. I think that's chron::Date, it's whatever you get when you set class(tr$visit_date)<-'Date') or use read.csv(colClasses=c(...,"Date",...)

These are the date utility fns I want to run a lot of (but currently they blow up during aggregate):

# Utility fns related to date    
library(chron)    
# Get dayoftheweek as integer 0(Sun)..6(Sat)
dayofweek <- function(ddd) {
    with( month.day.year(ddd), day.of.week(month,day,year) )
    #do.call( "day.of.week", month.day.year(x) )
    # as.numeric(x-3)%%7
}    
weekofyear <- function(dt) {
    as.numeric(format(as.Date(dt), "%W"))
}
monthofyear <- function(dt) {
    as.numeric(format(as.Date(dt), "%m"))
}    
# Append a dayoftheweek-as-string field
append_dotwa_column <- function(x,from_date_fld) {
    x$dotwa <- format(x$from_date_fld,"%a")
}

and here's just one aggregate() call that fails out-of-memory:

agg_dly_spend  <- aggregate(tr$visit_spend, 
        list('visit_date'=tr$visit_date), FUN="sum")
agg_wkly_spend <- aggregate(tr$visit_spend, 
        list('weekofyear'=weekofyear(tr$visit_date)), FUN="sum")

(How much memory should those aggregate() calls take? Correct me if I'm wrong but the mixed-types make it hard to use bigmemory. So I may have to go to SQL, but that's a big loss - I lose R's really nice subsetting-by-date: tr[tr$visit_date > "2010-09-01",])

(Platform is R 2.13.1, Windows Vista 32b so there is a 2Gb overall process limit, which means any data-frame should not exceed ~600-900Mb)

naide
  • 293
  • 3
  • 14
smci
  • 32,567
  • 20
  • 113
  • 146

1 Answers1

6

EDIT : The code I copied were not the final functions, so there were bugs in it. Bugs now fixed.

I do not completely agree with the votes to close, but your question does need some reading. As I understood it, the problem is the representation of the date. Numeric is just a crazy idea, use integer in that case. Just as an overview of the different formats and their relative space (using the lsos function from this question:)

Dates <- rep(seq.Date(as.Date("2010-04-01"),
             as.Date("2011-04-01"),by='day'),each=100)

chardates <- as.character(Dates)
POSIXctDates <- as.POSIXct(Dates)
POSIXltDates <- as.POSIXlt(Dates)
integerDates <- as.integer(Dates)
chronDates <- as.chron(Dates)
numericDates <- as.numeric(Dates)

 > lsos()
                  Type    Size PrettySize  Rows Columns
POSIXltDates   POSIXlt 1464976     1.4 Mb 36600      NA
chronDates       dates  293400   286.5 Kb 36600      NA
POSIXctDates   POSIXct  292976   286.1 Kb 36600      NA
Dates             Date  292944   286.1 Kb 36600      NA
numericDates   numeric  292824     286 Kb 36600      NA
charDates    character  161064   157.3 Kb 36600      NA
integerDates   integer  146424     143 Kb 36600      NA

The internal Date can compete pretty well with the numeric representation. character causes trouble with all the rest of the functionality, so forget about that one. I just use Date, that'll do and keeps the functionality OK. Pay attention to the size of POSIXlt: All functions for extraction of months, weeks, day of the year etc. go over that format. That's true for format(), and for the functions weekdays(), months(), ... in either the base or the chron package.

Some remarks :

  • Vista has a theoretical 3Gb limit on a machine with 4Gb. did you try memory.limit(3000) : See ?memory.limit.
  • Please update asap to Windows 7 if you care anything at all about decent memory management. Vista just sucks. Big time. If possible, get a 64bit machine if you work with that data.
  • Clean up your workspace once in a while. I can run your code without problem on R 32bit with a 2Gb memory limit on the dataframe given below (14+ million rows).

On to your code. I work on with the Date format, which is about the same size as the numeric format. Let's try it with following data (which you could have provided...) with 14.6 million rows. I run a Windows7 (64bit) with 4Gb memory in total.

n <- 40000
tr <- data.frame(
  customer_id = rep(1:366,n),
  visit_date = rep(seq.Date(as.Date("2010-04-01"),
                     as.Date("2011-04-01"),by='day'),each=n),
  visit_spend =runif(366*n,3,12)
 )

First your weekofyear function. As said, The format function uses the underlying POSIXlt format, which is, as shown, memory-intensive. Still, you can cut out about half of the memory by just accessing it directly (see ?POSIXlt). It returns integers, which take about half the memory of the numerics you return :

dayofweek2 <- function(dt) {
  as.POSIXlt(dt)$wday
}

monthofyear2 <- function(dt){
  as.POSIXlt(dt)$mon +1L
}

weekofyear2 <- function(dt) {
  n <- length(dt)
  lt <- as.POSIXlt(dt)
  yearstart <- c(0,which(diff(lt$year)!=0))
  fday <- rep(lt[yearstart+1L]$wday,
              diff(c(yearstart,n)+1L)
          )
  as.integer((lt$yday+fday-1L)%/%7)
}

> lsos()
                   Type    Size PrettySize  Rows Columns
tr           data.frame  733128   715.9 Kb 36600       3
Dates              Date  293048   286.2 Kb 36600      NA
x1              numeric  292840     286 Kb 36600      NA # from your function
x2              integer  146440     143 Kb 36600      NA # from my function

If you need even less, you'll have to do the math yourself. But I advise you not to try that out, and definitely not based on character representation. string operations like strsplit() and substr() will blow up your memory for sure. As does the month.day.year() function of the chron package. Stay away from chron with big data. In fact, regardless of the huge space the POSIXlt objects need, using POSIXlt is still the best option memory-wise for extraction.

On to the aggregate. This is meant for dataframes, and hence aggregate call makes again a lot of copies of the data. Doing the call more manually, can save again on the copies. A proposal for a function :

my.agg <- function(x,d,AGGFUN=NULL,FUN="sum"){
  FUN <- match.fun(FUN)
  if(!is.null(AGGFUN)) {
    AGGFUN <- match.fun(AGGFUN)
    d <- AGGFUN(d)
  }
  ud <- sort(unique(d))

  xout <- sapply(ud,function(i){
    id <- which(d==i)  # find which values respond to a certain value of d
    FUN(x[id])         # apply the function to only those values
  },USE.NAMES=FALSE)

  data.frame(d=ud,x=xout)    
}

Now if we apply this and we watch the memory usage :

gc(TRUE,reset=TRUE)
x1 <-  aggregate(tr$visit_spend, 
           list('weekofyear'=weekofyear(tr$visit_date)), FUN="sum")
rm(x1)
gc(TRUE,reset=TRUE)
Sys.sleep(5)
x2 <- my.agg(tr$visit_spend,tr$visit_date,weekofyear2,sum)    
gc(TRUE,reset=TRUE)

, I get the following result :

enter image description here

The red square is your aggregate call, the yellow square is my proposal. The first bump in the memory usage of you aggregate call is the weekofyear function you use. My proposal saves both on the memory usage of weekofyear and of the aggregate call, and runs quite a bit faster too. I never got over 2.6Gb total memory using my proposal.

Hope this helps.

Community
  • 1
  • 1
Joris Meys
  • 106,551
  • 31
  • 221
  • 263
  • 1
    Might I suggest that the use of as.POSIXlt could be made a lot more space efficient with this sort of code: `z <- .Internal(Date2POSIXlt(tr$visit_date))$year` , avoiding adding names to the returned object. – IRTFM Aug 07 '11 at 14:53
  • @DWin : I tried it out, but it doesn't really save on used memory on my computer. It's the call itself that takes up the memory, not the final object. The difference in size between both objects is also neglectible. – Joris Meys Aug 07 '11 at 22:00
  • Great stuff, but I wonder if _my.agg()_ might have a functional bug? All 12 row-sums in the aggregate by monthofyear turn out identical. – smci Aug 08 '11 at 09:23
  • Also, how do you correctly invoke _my.agg()_ with no AGGFUN? The daily aggregate results are all 0. Neither `agg_dly_spend2 <- my.agg(tt$visit_date,tt$visit_spend)` nor `my.agg(tr$visit_spend,tr$visit_date,,sum)` seem to work correctly? – smci Aug 08 '11 at 09:25
  • @smci : There's indeed a bug in the my.agg() function. sapply(nd...) should off course be sapply(ud...). I corrected it in the code and it works. – Joris Meys Aug 08 '11 at 09:50
  • @smci : it avoids wasting memory as it does not stack up function calls, like aggregate() does. It's not in one line, it's in the concept of reducing copies of the object. – Joris Meys Aug 08 '11 at 10:46
  • That's brilliant, can you throw in a few more comments (inline) explaining at exactly which points it avoids wasting memory? I don't understand the funky syntax in the `sapply(ud,function(i){...}` part, or why you don't just inline id into `FUN(x[which(d==i)]` ? – smci Aug 08 '11 at 10:50
  • @JorisMeys let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/2217/discussion-between-smci-and-joris-meys) – smci Aug 08 '11 at 10:50