r trim columns in data.frame or data.table using gsub,trim, etc

Question

I have a question on removing leading a trailing blanks in a data.frame or data.table.

I have working solutions but I'm trying to speed up my code.

Here is some sample data:

number_strings <- paste("  ",seq(from=1, to=100000, by=1),"  ",sep="")

data <- as.data.frame(matrix(number_strings,nrow=length(number_strings),ncol=10),stringsAsFactors=FALSE)
colnames(data) <- paste("Col",seq(from=1, to=ncol(data), by=1),sep="")

Here are some columns I would like to trim:

odd_columns <- paste("Col",seq(from=1, to=ncol(data), by=2),sep="")

Here are the three options I have so far:

f_trim_for <- function(x,cols){
  for(i in 1:length(cols))
  {
    x[,cols[i]] = trim(x[,cols[i]])
  }
  return(x)
} 
system.time(data1 <- f_trim_for(data,odd_columns)) 

f_gsub_for <- function(x,cols){
  for(i in 1:length(cols))
  {
    x[,cols[i]] <- gsub("^\\s+|\\s+$", "", x[,cols[i]], perl = TRUE)
  }
  return(x)
} 
system.time(data2 <- f_gsub_for(data,odd_columns)) 

f_trim_dt <- function(x,cols){
  data.table(x)[, (cols) := trim(.SD), .SDcols = cols]
} 
system.time(data3 <- f_trim_dt(data,odd_columns))

Here are the times:

              user  system elapsed 
f_trim_for    1.50    0.08    1.92 
f_gsub_for    0.75    0.00    0.74 
f_trim_dt     0.81    0.00    1.17

My question: Are there other ways I'm not thinking about that could be faster?

The reason is that my actual data is 1.5 million rows and 110 columns. Hence, speed is a major issue.

I tried some other options but they aren't working:

f_gsub_dt <- function(x,cols){
  data.table(x)[, (cols) := gsub("^\\s+|\\s+$", "", .SD, perl = TRUE), .SDcols = cols]
} 

f_set_dt <- function(x,cols){
 for (j in cols)
 {
   set(x,x[[j]],j,gsub("^\\s+|\\s+$", "", j, perl = TRUE))
 }  
 return(x)
}

is this useful to you? http://stackoverflow.com/questions/2261079/how-to-trim-leading-and-trailing-whitespace-in-r — rawr, Jan 11 '14 at 18:22
I've seen that post and used the gsub method from it. I was trying to do something similar in data.tables. — Brad, Jan 11 '14 at 18:31
Look at the answers to this recent question: http://stackoverflow.com/q/21056297/1412059 You should use `set`. — Roland, Jan 11 '14 at 18:31
Roland, At the end of my question, I have a set example but couldn't get it to work. Could you tell what was wrong? — Brad, Jan 11 '14 at 18:41
could you trim it with the `strip.white` argument when you read the data in? — Tyler Rinker, Jan 11 '14 at 20:21
@Brad, seems like Roland's tip lead you to the answer you're looking for? Why not post it as an answer yourself? — Arun, Jan 11 '14 at 21:03

score 2 · Answer 1 · answered Jan 11 '14 at 19:41

2

Use colwise from plyr and str_trim from stringr.

require(plyr)
require(stringr)
data[, odd_columns] <- colwise(str_trim)(data[, odd_columns])

answered Jan 11 '14 at 19:41

Jared

3,510
3
25
28

1

And is this fast? (main thing OP is looking for) – eddi Jan 13 '14 at 14:13
Didn't time it but trim should be much faster than a regex search. And avoiding multiple subsets of data.frames with [ ] should speed things up. – Jared Jan 13 '14 at 20:01

score 1 · Answer 2 · answered Nov 03 '17 at 13:22

1

with dplyr and the base function trimws you can trim all character columns at once;

mutate_if(data, is.character, funs(trimws(.)))

answered Nov 03 '17 at 13:22

visu-l

424
1
5
15

r trim columns in data.frame or data.table using gsub,trim, etc

2 Answers2