0

I have a data frame consisting of +10 million records (all_postcodes). [Edit] Here are just a few records:

pcode  area  east    north   area2     area3      area4      area5
AB101AA 10  394251  806376  S92000003 S08000006  S12000033  S13002483
AB101AB 10  394232  806470  S92000003 S08000006  S12000033  S13002483
AB101AF 10  394181  806429  S92000003 S08000006  S12000033  S13002483
AB101AG 10  394251  806376  S92000003 S08000006  S12000033  S13002483

I want to create a new column containing normalised versions of one of the columns using the following function:

pcode_normalize <- function (x) {
x <- gsub("  ", " ", x)
if (length(which(strsplit(x, "")[[1]]==" ")) == 0) {
x <- paste(substr(x, 1, 4), substr(x, 5, 7))
}
x
}

I tried to execute it as follows:

all_postcodes$npcode <- sapply(all_postcodes$pcode, pcode_normalize)

but it takes too long. Any suggestions how to improve the performance?

Nick
  • 2,924
  • 4
  • 36
  • 43
  • 2
    Can you please `dput` a few rows of `all_postcodes$pcode`? [How to create a **minimal, reproducible example**](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610) – Henrik Mar 24 '14 at 13:22
  • Sorry - just did that! Thanks for the suggestion! – Nick Mar 24 '14 at 13:54
  • you may want to use `(g)sub("[ ]{2,}",' ', x)` instead , which is more general – Janhoo Mar 24 '14 at 14:57

1 Answers1

6

All the functions you used in pcode_normalize are already vectorized. There's no need to loop using sapply. It also looks like you're using strsplit to look for single-spaces. grepl would be faster.

Using fixed=TRUE in your calls to gsub and grepl will be faster, since you're not actually using regular expressions.

pcode_normalize <- function (x) {
  x <- gsub("  ", " ", x, fixed=TRUE)
  sp <- grepl(" ", x, fixed=TRUE)
  x[!sp] <- paste(substr(x[!sp], 1, 4), substr(x[!sp], 5, 7))
  x
}
all_postcodes$npcode <- pcode_normalize(all_postcodes$pcode)

I couldn't actually test this, since you didn't provide any example data, but it should get you on the right path.

Joshua Ulrich
  • 173,410
  • 32
  • 338
  • 418
  • You are right - it works and is lightning fast! The biggest improvement is removing sapply (previously I had to stop R as it was taking more than an hour) but your version of the function is also much faster. Now without sapply and with your code it takes less than a second. Thanks a lot! – Nick Mar 24 '14 at 17:02