3

I want to use apply instead of a for loop to speed up a function that creates a character string vector from paste-collapsing each row in a data frame, which contains strings and numbers with many decimals.
The speed up is notable, but apply forces the numbers to fill the left side with spaces so that all values have the same number of characters and rounds the numbers to integers, whereas the for loop does not.
I was able to work around this doing as.character to the numbers, but the data frame memory usage is much larger, and I still don't know why apply does this. Does anyone have an explanation or a better solution?

Using apply:

df <- data.frame(V1=rep(letters[1:20], 1000/20), V2=(1:1000)+0.00000001,
 + V3=rep(letters[1:20], 1000/20), stringsAsFactors=F)

system.time(varapl <- apply(df, 1, function(x){
                paste(x[1:3], collapse="_")
                }))
varapl[c(1,10,100,1000)]

Output:

  user  system elapsed 
  0.01    0.00    0.02 

[1] "a_   1_a" "j_  10_j" "t_ 100_t" "t_1000_t"
# Spaces to the right and rounded!

Using for:

varfor <- NULL
system.time(for(i in 1:1000){
  varfor <- c(varfor, paste(df[i,1:3], collapse="_"))
})
varfor[c(1,10,100,1000)]

Output:

   user  system elapsed 
   0.19    0.00    0.19 

[1] "a_1.00000001_a"    "j_10.00000001_j"   "t_100.00000001_t"  "t_1000.00000001_t"
# This is what I'm looking for!

The workaround was:

df2 <- data.frame(V1=rep(letters[1:20], 1000/20), 
+ V2=as.character((1:1000)+0.00000001),
+ V3=rep(letters[1:20], 1000/20), stringsAsFactors=F)

varapl[c(1,10,100,1000)]

[1] "a_1.00000001_a"   "j_10.00000001_j"  "t_100.00000001_t"  "t_1000.00000001_t"

However:

object.size(df)
26816 bytes
object.size(df2)
97208 bytes

My original data frames have millions of entries, so both speed and memory constraints are important.

Thank you in advance for your comments! Keo.

Keo
  • 31
  • 2
  • 3
    See, also, `do.call(paste, c(df, sep = "_"))` – alexis_laz Feb 08 '15 at 18:08
  • I'm not sure to understand what you want exactly, but is the apply really needed ? It seem that I can get what you want with `paste(df[,1],format(df[,2],trim=T,digits=10),df[,3],sep='_')` – xraynaud Feb 08 '15 at 18:10
  • @alexis_laz Your solution is very fast on 1e7 dataset. `user system elapsed 2.009 0.001 2.008` – akrun Feb 08 '15 at 18:33
  • Yes, the good old `do.call(paste, c(df, sep = "_"))` is even faster than `unite`... – David Arenburg Feb 08 '15 at 18:44
  • See [this](http://stackoverflow.com/questions/15618527/why-does-as-matrix-add-extra-spaces-when-converting-numeric-to-character) for one part of the question (`apply` calls `as.matrix`) and [this](http://stackoverflow.com/questions/21682462/concatenate-columns-and-add-them-to-beginning-of-data-frame) (out of many similar QAs) for other part of the question. – alexis_laz Feb 08 '15 at 19:09
  • Almost never is this necessary. You should explain what your actual task is. This way of copying columns to result in character vector is terribly inefficient. – Arun Feb 08 '15 at 19:19
  • @alexis_laz Great solution! Thanks! Your link answers my question completely, it is good to know exactly why apply does this. – Keo Feb 09 '15 at 18:57
  • @Arun The original intent is to generate a system (OS) command that can be passed via `system()`, using the values in each row as parameters. – Keo Feb 09 '15 at 19:02

2 Answers2

3

I'm not sure what's causing this behavior of apply, but I'd propose an alternative since you're interested in speed. Take a look at Hadleys package tidyr and its function unite.

library(tidyr)

df <- data.frame(V1=rep(letters[1:20], 1000/20), V2=(1:1000)+0.00000001,
                 V3=rep(letters[1:20], 1000/20), stringsAsFactors=F)

unite(df, var, c(V1, V2, V3))

#              var
# 1 a_1.00000001_a
# 2 b_2.00000001_b
# 3 c_3.00000001_c
# 4 d_4.00000001_d
# 5 e_5.00000001_e
# 6 f_6.00000001_f

system.time(varapl <- unite(df, var, c(V1, V2, V3)))

# user  system elapsed 
#   0       0       0 
Drvi
  • 51
  • 1
  • 4
0


@alexis_laz answered the question (Thanks!) by linking to this. I'm posting it here since it it was mentioned in the comments section.

Community
  • 1
  • 1
Keo
  • 31
  • 2