Executive summary: *Improve the memory efficiency of the date-based extraction functions used in aggregate fn calls below; to not blow out the 1Gb memory limit. *.
I have a large dataset tr stored in a data-frame (3 cols, 12 million rows; ~200Mb)
The columns are customer_id (integer), visit_date and visit_spend(numeric) The dataset requires registration, so this is as reproducible as it can be:
Dataset looks like (full file is here, requires registration):
customer_id,visit_date,visit_spend
2,2010-04-01,5.97
2,2010-04-06,12.71
2,2010-04-07,34.52
#... 12146637 rows in total
The date range is restricted between 2010-04-01 ... 2011-06-30 (14700..15155 in integer)
Here I'm asking what is the optimal representation to choose for the visit_date field. I do some aggregate
calls (example code at bottom) that blow up the memory. I also use date utility fns something like what is attached at bottom here (they will need recoding for compactness, but these are the typical operations I want to do a lot of). So I need a representation for the date that avoids this.
As I see it there are three possible representations I could use for visit_date field, here are the pros and cons, wrt what I am trying to do. My aim is to get the format which does not blow up memory, and gives the least grief during these date handling operations, aggregate etc.:
- integer or factor
Cons:
1) doesn't allow comparison or sort operations, hence aggregation is painful.
2) I would need to hardcode all the date-related functions (e.g. 14700..14729 for Apr 2010) - doable but painful.
3) Needs manual handling for graphs. - numeric
Cons: blows up memory due to requiring asDate() everywhere. - Date
Pros: most readable for print(), graphs and histograms; does not need manual handling treatment before graphing.
Cons: blows up (out-of-memory fail) if I apply any string operations (format), or aggregate. I think that's chron::Date, it's whatever you get when you setclass(tr$visit_date)<-'Date')
or useread.csv(colClasses=c(...,"Date",...)
These are the date utility fns I want to run a lot of (but currently they blow up during aggregate):
# Utility fns related to date
library(chron)
# Get dayoftheweek as integer 0(Sun)..6(Sat)
dayofweek <- function(ddd) {
with( month.day.year(ddd), day.of.week(month,day,year) )
#do.call( "day.of.week", month.day.year(x) )
# as.numeric(x-3)%%7
}
weekofyear <- function(dt) {
as.numeric(format(as.Date(dt), "%W"))
}
monthofyear <- function(dt) {
as.numeric(format(as.Date(dt), "%m"))
}
# Append a dayoftheweek-as-string field
append_dotwa_column <- function(x,from_date_fld) {
x$dotwa <- format(x$from_date_fld,"%a")
}
and here's just one aggregate() call that fails out-of-memory:
agg_dly_spend <- aggregate(tr$visit_spend,
list('visit_date'=tr$visit_date), FUN="sum")
agg_wkly_spend <- aggregate(tr$visit_spend,
list('weekofyear'=weekofyear(tr$visit_date)), FUN="sum")
(How much memory should those aggregate() calls take?
Correct me if I'm wrong but the mixed-types make it hard to use bigmemory
. So I may have to go to SQL, but that's a big loss - I lose R's really nice subsetting-by-date: tr[tr$visit_date > "2010-09-01",]
)
(Platform is R 2.13.1, Windows Vista 32b so there is a 2Gb overall process limit, which means any data-frame should not exceed ~600-900Mb)