I have a data-frame (3 cols, 12146637 rows) called tr.sql which occupies 184Mb. (it's backed by SQL, it is the contents of my dataset which I read in via read.csv.sql)
Column 2 is tr.sql$visit_date. SQL does not allow natively representing dates as an R Date object, this is important for how I need to process the data.
Hence I want to copy the contents of tr.sql to a new data-frame tr (where the visit_date column can be natively represented as Date (chron::Date?). Trust me, this makes exploratory data analysis easier, for now this is how I want to do it - I might use native SQL eventually but please don't quibble that for now.)
Here is my solution (thanks to gsk and everyone) + workaround:
tr <- data.frame(customer_id=integer(N), visit_date=integer(N), visit_spend=numeric(N))
# fix up col2's class to be Date
class(tr[,2]) <- 'Date'
then workaround copying tr.sql -> tr in chunks of (say) N/8 using a for-loop, so that the temporary involved in the str->Date conversion does not out-of-memory, and a garbage-collect after each:
for (i in 0:7) { from <- floor(i*N/8) to <- floor((i+1)*N/8) -1 if (i==7) to <- N print(c("Copying tr.sql$visit_date",from,to," ...")) tr$visit_date[from:to] <- as.Date(tr.sql$visit_date[from:to]) gc() } rm(tr.sql) memsize_gc() ... # only 321 Mb in the end! (was ~1Gb during copying)
The problem is allocating then copying the visit_date column. Here is the dataset and code, I am having multiple separate problems with this, explanation below:
'training.csv' looks like...
customer_id,visit_date,visit_spend
2,2010-04-01,5.97
2,2010-04-06,12.71
2,2010-04-07,34.52
and code:
# Read in as SQL (for memory-efficiency)...
library(sqldf)
tr.sql <- read.csv.sql('training.csv')
gc()
memory.size()
# Count of how many rows we are about to declare
N <- nrow(tr.sql)
# Declare a new empty data-frame with same columns as the source d.f.
# Attempt to declare N Date objects (fails due to bad qualified name for Date)
# ... does this allocate N objects the same as data.frame(colname = numeric(N)) ?
tr <- data.frame(visit_date = Date(N))
tr <- tr.sql[0,]
# Attempt to assign the column - fails
tr$visit_date <- as.Date(tr.sql$visit_date)
# Attempt to append (fails)
> tr$visit_date <- append(tr$visit_date, as.Date(tr.sql$visit_date))
Error in `$<-.data.frame`(`*tmp*`, "visit_date", value = c("14700", "14705", :
replacement has 12146637 rows, data has 0
- The second line that tries to declare
data.frame(visit_date = Date(N))
fails, I don't know the correct qualified name with namespace for Date object (triedchron::Date , Dates::Date
? don't work) - Both the attempt to assign and append fail. Not even sure whether it is legal, or efficient, to use append on a single large column of a data-frame.
Remember these objects are big, so avoid using temporaries. Thanks in advance...