2

I have a column with addresses. The data is not clean and the information includes street and house number or sometimes postcode and city. I would like to move the postcode and city information to another column with R, while street and house number stay in the old place. The postcode is a 4 digit number string. I am grateful for any suggestion for a solution.

Table right now

Desired outcome

user438383
  • 5,716
  • 8
  • 28
  • 43
Elena
  • 31
  • 1
  • Edit your post to add some sample data and you will likely get faster help. Look into dput() to paste your data structure – jpsmith Jan 07 '22 at 11:42
  • Welcome to SO! Would you mind providing [a minimal reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) including a snippet of your data or some fake data and the code you tried. Please do not post an image of code/data/errors [for these reasons](https://meta.stackoverflow.com/questions/285551/why-not-upload-images-of-code-errors-when-asking-a-question/285557#285557). – stefan Jan 07 '22 at 11:56
  • ... this said you could try with something like `df$Ort <- ifelse(grepl("^\\d{4}", df$Halter), df$Halter, NA_character_)` and `df$Strasse <- ifelse(!grepl("^\\d{4}", df$Halter), df$Halter, NA_character_)` – stefan Jan 07 '22 at 11:57

4 Answers4

2

An ifelse with grepl should help -

library(dplyr)

df <- df %>%
  mutate(Strasse = ifelse(grepl('^\\d{4}', Halter), '', Halter), 
         Ort = ifelse(Strasse == '', Halter, ''))

#  Line          Halter     Strasse             Ort
#1    1        1007 Abc                    1007 Abc
#2    2 1012 Long words             1012 Long words
#3    3     Enelbach 54 Enelbach 54                
#4    4         Abcd 56     Abcd 56                
#5    5      Engasse 21  Engasse 21                

grepl('^\\d{4}', Halter) returns TRUE if it finds a 4-digit number at the start of the string else returns FALSE.

data

It is easier to help if you provide data in a reproducible format

df <- data.frame(Line = 1:5, 
                 Halter = c('1007 Abc', '1012 Long words', 'Enelbach 54', 
                            'Abcd 56', 'Engasse 21'))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
2

In addition to the neat solution of @Ronak Shah, if you want to use base R

df <- data.frame(Line = 1:5, 
                 Halter = c('1007 Abc', '1012 Long words', 'Enelbach 54', 
                            'Abcd 56', 'Engasse 21'))
df$Strasse <- with(df, ifelse(grepl('^\\d{4}', Halter), '', Halter))
df$Ort <- with(df, ifelse(Strasse == '', Halter, ''))
> head(df)
  Line          Halter     Strasse             Ort
1    1        1007 Abc                    1007 Abc
2    2 1012 Long words             1012 Long words
3    3     Enelbach 54 Enelbach 54                
4    4         Abcd 56     Abcd 56                
5    5      Engasse 21  Engasse 21                
Dion Groothof
  • 1,406
  • 5
  • 15
2

An option is also with separate

library(dplyr)
library(tidyr)
df %>% 
  separate(Halter, into = c("Strasse", "Ort"), sep = "(?<=[0-9])$|^(?=[0-9]{4} )")
  Line     Strasse             Ort
1    1                    1007 Abc
2    2             1012 Long words
3    3 Enelbach 54                
4    4     Abcd 56                
5    5  Engasse 21                

data

df <- structure(list(Line = 1:5, Halter = c("1007 Abc", "1012 Long words", 
"Enelbach 54", "Abcd 56", "Engasse 21")), class = "data.frame", row.names = c(NA, 
-5L))
akrun
  • 874,273
  • 37
  • 540
  • 662
2

Suisse postal codes are made up of 4 digits:

library(dplyr)
library(stringr)
df %>% 
  mutate(Strasse = str_extract(Halter, '\\d{4}\\s.+'))
  Line          Halter         Strasse
1    1        1007 Abc        1007 Abc
2    2 1012 Long words 1012 Long words
3    3     Enelbach 54            <NA>
4    4         Abcd 56            <NA>
5    5      Engasse 21            <NA>
TarJae
  • 72,363
  • 6
  • 19
  • 66