2

I am running out of memory using strsplit (presumably); here is the code:

split.fields <- function (frame, fields, split, suffix, ...) {
  for (field in fields) {
    v <- sapply(strsplit(frame[[field]],"@",...),"[",1)
    frame[[paste0(field,suffix)]] <- frame[[field]]
    frame[[field]] <- v
  }
  frame
}
split.version <- function (frame, fields)
  split.fields(frame, fields, split="@", suffix="Ver", fixed=TRUE)
> gc()
         used (Mb) gc trigger (Mb) max used (Mb)
Ncells 238165 12.8     467875   25   407500 21.8
Vcells 369492  2.9     905753    7   905631  7.0
> frame <- data.frame(browser = sample(c("Chrome@28","Chrome@27","Firefox@21","Firefox@22","IE@9","IE@8"), 1e7, replace=TRUE), stringsAsFactors=FALSE)
> str(frame)
'data.frame':   10000000 obs. of  1 variable:
 $ browser: chr  "IE@8" "Chrome@27" "Chrome@27" "Chrome@27" ...
> object.size(frame)
80000992 bytes
> gc()
           used (Mb) gc trigger  (Mb) max used  (Mb)
Ncells   240555 12.9     467875  25.0   407500  21.8
Vcells 10373979 79.2   34109873 260.3 40534688 309.3
> system.time(frame <- split.version(frame,"browser"))
   user  system elapsed 
 73.700   0.872  74.831 
> object.size(frame)
160001248 bytes
> str(frame)
'data.frame':   10000000 obs. of  2 variables:
 $ browser   : chr  "IE" "Chrome" "Chrome" "Chrome" ...
 $ browserVer: chr  "IE@8" "Chrome@27" "Chrome@27" "Chrome@27" ...
> gc()
           used  (Mb) gc trigger  (Mb)  max used   (Mb)
Ncells   264888  14.2   16652260 889.4  31376740 1675.7
Vcells 20459856 156.1   95461025 728.4 119226749  909.7

This all looks more or less reasonable except that the R process's RSS is now 1.6G.

This appears to imply that the 1675.7Mb of Ncells in the max used column have not been returned to the OS.

I don't care much about the OS not getting back the RAM, what I do care is that to process 80M of data R allocated 1.6G (and on my real data it runs out of the physical RAM available)

Is there a way to make this more memory efficient?

E.g., maybe converting the character vector to a factor and operating on its levels would help?

R version 3.0.1 (2013-05-16) -- "Good Sport"
Platform: x86_64-pc-linux-gnu (64-bit)
sds
  • 58,617
  • 29
  • 161
  • 278

2 Answers2

4

How about using substr and regexpr:

x <- c("Chrome@28","Chrome@27","Firefox@21","IE@8")
substr(x,1,regexpr("@",x)-1)
[1] "Chrome"  "Chrome"  "Firefox" "IE" 
James
  • 65,548
  • 14
  • 155
  • 193
3

What @James said, or even simpler:

x <- c("Chrome@28","Chrome@27","Firefox@21","IE@8")
sub('@.*', '', x)
#[1] "Chrome"  "Chrome"  "Firefox" "IE"  
eddi
  • 49,088
  • 6
  • 104
  • 155