I have a data frame consisting of +10 million records (all_postcodes). [Edit] Here are just a few records:
pcode area east north area2 area3 area4 area5
AB101AA 10 394251 806376 S92000003 S08000006 S12000033 S13002483
AB101AB 10 394232 806470 S92000003 S08000006 S12000033 S13002483
AB101AF 10 394181 806429 S92000003 S08000006 S12000033 S13002483
AB101AG 10 394251 806376 S92000003 S08000006 S12000033 S13002483
I want to create a new column containing normalised versions of one of the columns using the following function:
pcode_normalize <- function (x) {
x <- gsub(" ", " ", x)
if (length(which(strsplit(x, "")[[1]]==" ")) == 0) {
x <- paste(substr(x, 1, 4), substr(x, 5, 7))
}
x
}
I tried to execute it as follows:
all_postcodes$npcode <- sapply(all_postcodes$pcode, pcode_normalize)
but it takes too long. Any suggestions how to improve the performance?