1

I have put lapply statements (postal codes coming out of 5 large text fields) in a function:

opm_naar_postc=function(kolom1,kolom2,kolom3,kolom4,kolom5) {
    postc=lapply(kolom1, function(x) unlist(regmatches(x,gregexpr("((\\D)[1-4][0-9][0-9][0-9][' '][a-zA-Z][a-zA-Z](\\D))", x)))[1])
    postc1=lapply(kolom1, function(x) unlist(regmatches(x,gregexpr("((\\D)[1-4][0-9][0-9][0-9][a-zA-Z][a-zA-Z](\\D))", x)))[1])
    postc2=lapply(kolom2, function(x) unlist(regmatches(x,gregexpr("((\\D)[1-4][0-9][0-9][0-9][' '][a-zA-Z][a-zA-Z](\\D))", x)))[1])
    postc3=lapply(kolom2, function(x) unlist(regmatches(x,gregexpr("((\\D)[1-4][0-9][0-9][0-9][a-zA-Z][a-zA-Z](\\D))", x)))[1])
    postc4=lapply(kolom3, function(x) unlist(regmatches(x,gregexpr("((\\D)[1-4][0-9][0-9][0-9][' '][a-zA-Z][a-zA-Z](\\D))", x)))[1])
    postc5=lapply(kolom3, function(x) unlist(regmatches(x,gregexpr("((\\D)[1-4][0-9][0-9][0-9][a-zA-Z][a-zA-Z](\\D))", x)))[1])
    postc6=lapply(kolom4, function(x) unlist(regmatches(x,gregexpr("((\\D)[1-4][0-9][0-9][0-9][' '][a-zA-Z][a-zA-Z](\\D))", x)))[1])
    postc7=lapply(kolom4, function(x) unlist(regmatches(x,gregexpr("((\\D)[1-4][0-9][0-9][0-9][a-zA-Z][a-zA-Z](\\D))", x)))[1])
    postc8=lapply(kolom5, function(x) unlist(regmatches(x,gregexpr("((\\D)[1-4][0-9][0-9][0-9][' '][a-zA-Z][a-zA-Z](\\D))", x)))[1])
    postc9=lapply(kolom5, function(x) unlist(regmatches(x,gregexpr("((\\D)[1-4][0-9][0-9][0-9][a-zA-Z][a-zA-Z](\\D))", x)))[1])

Then I want to remove any spaces, dots, NAs etc out of postc to postc9

postcodes=c("postc","postc1","postc2","postc3","postc4","postc5","postc6","postc7","postc8","postc9")
for (i in postcodes) {
  i=gsub(" ","",i)
  i=gsub("NA|[[:punct:]]","",i)  }

Eventually, I paste all the postc to postc9 together, so one variable is left. this variable is my return variable. So I call the function like this:

df = df %>% mutate(postcode=opm_naar_postc(var1,var2,var3,var4,var5)) 

First of all, the for loop doesn't work (no error, but it doesn't do anything). It does work when I dont use a for loop. Second of all, I want to put all the 10 apply rules in one for loop, is that possible? I've tried a lot of things, but it doesn't seem to work...

Who can help me?

Thanks!

An example of my dataframe df:

   var1            var2          var3               var4         var5
blablaehdhde    blablatext   blabla 1983 rf    blablatext     blablatext
1982 rf blabla text blala     blablbal         blaakakk text  hahahahah
blblatext      textte8743GH  sdkhflksfjf       kjsnhblabla     gagagagag

Expected outcome:

postcode
1983rf
1982rf
8743GH
W. Mooi
  • 119
  • 1
  • 3
  • 11
  • What is your expected output? – Sotos Oct 10 '17 at 11:22
  • One variabele "postcode" with the postal code strings without the blanks, NA's etc in dataframe df – W. Mooi Oct 10 '17 at 11:29
  • Can you give a small [reproducible example](http://stackoverflow.com/questions/5963269) of your data frame please? – Sotos Oct 10 '17 at 11:30
  • The loop doesn't work because your not altering your vector `postcodes` but the loop variable `i`, which is not returned. Making `i` an integer counter and replacing `i` in the loop by `postcodes[i]` seems to do what you want: `postcodes=c("postc","post c1","postc2","postc3","post c4","postc5","postc6","postc7","postc8","postc9") for (i in 1:length(postcodes)) { postcodes[i]=gsub(" ","",postcodes[i]) postcodes[i]=gsub("NA|[[:punct:]]","",postcodes[i]) } ` – xraynaud Oct 10 '17 at 11:40
  • I've tried this, but nothing changes. If I for example return the variable "postc" (the first element of postcodes), then the spaces are not replaced... – W. Mooi Oct 10 '17 at 11:49

2 Answers2

1

Here is an idea using regex,

gsub('^\\D*?(\\d+)\\s?(\\D{2}).*$', '\\1\\2', grep('\\d+', unlist(df), value = TRUE))

#   var12    var23    var31 
#"1982rf" "8743GH" "1983rf" 
Sotos
  • 51,121
  • 6
  • 32
  • 66
0

You can try:

# your data
df <- structure(c("blablaehdhde", "1982 rf blabla", "blblatext", "blablatext", 
"text blala", "textte8743GH", "blabla 1983 rf", "blablbal", "sdkhflksfjf", 
"blablatext", "blaakakk text", "kjsnhblabla", "blablatext", "hahahahah", 
"gagagagag"), .Dim = c(3L, 5L), .Dimnames = list(NULL, c("var1", 
"var2", "var3", "var4", "var5")))


# pipeline
library(tidyverse)
library(stringi)
as.tibble(df) %>% 
          gather() %>% 
          mutate(value=gsub(" ", "", value)) %>% 
          mutate(postcode=stri_extract_all_regex(value, "[0-9]+(.{2})", simplify =T)) %>% 
          filter(!is.na(postcode)) 
# A tibble: 3 x 3
    key        value postcode
  <chr>        <chr>    <chr>
1  var1 1982rfblabla   1982rf
2  var2 textte8743GH   8743GH
3  var3 blabla1983rf   1983rf
Roman
  • 17,008
  • 3
  • 36
  • 49