I am running out of memory using strsplit
(presumably); here is the code:
split.fields <- function (frame, fields, split, suffix, ...) {
for (field in fields) {
v <- sapply(strsplit(frame[[field]],"@",...),"[",1)
frame[[paste0(field,suffix)]] <- frame[[field]]
frame[[field]] <- v
}
frame
}
split.version <- function (frame, fields)
split.fields(frame, fields, split="@", suffix="Ver", fixed=TRUE)
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 238165 12.8 467875 25 407500 21.8
Vcells 369492 2.9 905753 7 905631 7.0
> frame <- data.frame(browser = sample(c("Chrome@28","Chrome@27","Firefox@21","Firefox@22","IE@9","IE@8"), 1e7, replace=TRUE), stringsAsFactors=FALSE)
> str(frame)
'data.frame': 10000000 obs. of 1 variable:
$ browser: chr "IE@8" "Chrome@27" "Chrome@27" "Chrome@27" ...
> object.size(frame)
80000992 bytes
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 240555 12.9 467875 25.0 407500 21.8
Vcells 10373979 79.2 34109873 260.3 40534688 309.3
> system.time(frame <- split.version(frame,"browser"))
user system elapsed
73.700 0.872 74.831
> object.size(frame)
160001248 bytes
> str(frame)
'data.frame': 10000000 obs. of 2 variables:
$ browser : chr "IE" "Chrome" "Chrome" "Chrome" ...
$ browserVer: chr "IE@8" "Chrome@27" "Chrome@27" "Chrome@27" ...
> gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 264888 14.2 16652260 889.4 31376740 1675.7
Vcells 20459856 156.1 95461025 728.4 119226749 909.7
This all looks more or less reasonable except that the R
process's RSS is now 1.6G.
This appears to imply that the 1675.7Mb of Ncells in the max used
column have not been returned to the OS.
I don't care much about the OS not getting back the RAM, what I do care is that to process 80M of data R allocated 1.6G (and on my real data it runs out of the physical RAM available)
Is there a way to make this more memory efficient?
E.g., maybe converting the character vector to a factor and operating on its levels would help?
R version 3.0.1 (2013-05-16) -- "Good Sport"
Platform: x86_64-pc-linux-gnu (64-bit)