4

I have a question on removing leading a trailing blanks in a data.frame or data.table.

I have working solutions but I'm trying to speed up my code.

Here is some sample data:

number_strings <- paste("  ",seq(from=1, to=100000, by=1),"  ",sep="")

data <- as.data.frame(matrix(number_strings,nrow=length(number_strings),ncol=10),stringsAsFactors=FALSE)
colnames(data) <- paste("Col",seq(from=1, to=ncol(data), by=1),sep="")

Here are some columns I would like to trim:

odd_columns <- paste("Col",seq(from=1, to=ncol(data), by=2),sep="")

Here are the three options I have so far:

f_trim_for <- function(x,cols){
  for(i in 1:length(cols))
  {
    x[,cols[i]] = trim(x[,cols[i]])
  }
  return(x)
} 
system.time(data1 <- f_trim_for(data,odd_columns)) 

f_gsub_for <- function(x,cols){
  for(i in 1:length(cols))
  {
    x[,cols[i]] <- gsub("^\\s+|\\s+$", "", x[,cols[i]], perl = TRUE)
  }
  return(x)
} 
system.time(data2 <- f_gsub_for(data,odd_columns)) 

f_trim_dt <- function(x,cols){
  data.table(x)[, (cols) := trim(.SD), .SDcols = cols]
} 
system.time(data3 <- f_trim_dt(data,odd_columns)) 

Here are the times:

              user  system elapsed 
f_trim_for    1.50    0.08    1.92 
f_gsub_for    0.75    0.00    0.74 
f_trim_dt     0.81    0.00    1.17 

My question: Are there other ways I'm not thinking about that could be faster?

The reason is that my actual data is 1.5 million rows and 110 columns. Hence, speed is a major issue.

I tried some other options but they aren't working:

f_gsub_dt <- function(x,cols){
  data.table(x)[, (cols) := gsub("^\\s+|\\s+$", "", .SD, perl = TRUE), .SDcols = cols]
} 

f_set_dt <- function(x,cols){
 for (j in cols)
 {
   set(x,x[[j]],j,gsub("^\\s+|\\s+$", "", j, perl = TRUE))
 }  
 return(x)
}
Brad
  • 813
  • 1
  • 10
  • 20
  • is this useful to you? http://stackoverflow.com/questions/2261079/how-to-trim-leading-and-trailing-whitespace-in-r – rawr Jan 11 '14 at 18:22
  • I've seen that post and used the gsub method from it. I was trying to do something similar in data.tables. – Brad Jan 11 '14 at 18:31
  • 1
    Look at the answers to this recent question: http://stackoverflow.com/q/21056297/1412059 You should use `set`. – Roland Jan 11 '14 at 18:31
  • Roland, At the end of my question, I have a set example but couldn't get it to work. Could you tell what was wrong? – Brad Jan 11 '14 at 18:41
  • could you trim it with the `strip.white` argument when you read the data in? – Tyler Rinker Jan 11 '14 at 20:21
  • 3
    @Brad, seems like Roland's tip lead you to the answer you're looking for? Why not post it as an answer yourself? – Arun Jan 11 '14 at 21:03

2 Answers2

2

Use colwise from plyr and str_trim from stringr.

require(plyr)
require(stringr)
data[, odd_columns] <- colwise(str_trim)(data[, odd_columns]) 
Jared
  • 3,510
  • 3
  • 25
  • 28
  • 1
    And is this fast? (main thing OP is looking for) – eddi Jan 13 '14 at 14:13
  • Didn't time it but trim should be much faster than a regex search. And avoiding multiple subsets of data.frames with [ ] should speed things up. – Jared Jan 13 '14 at 20:01
1

with dplyr and the base function trimws you can trim all character columns at once;

mutate_if(data, is.character, funs(trimws(.)))
visu-l
  • 424
  • 1
  • 5
  • 15