If you are going to add a loop, it makes no sense to use case_when()
; you don't have to add all options into it if you can loop over them.
You can solve it with a for-loop:
library(stringi)
df2 <- df
for(c in city) df2$city[stri_detect_fixed(df2$address, c)] <- c
for(d in district) df2$district[stri_detect_fixed(df2$address, d)] <- d
for(s in streets) df2$street[stri_detect_fixed(df2$address, s)] <- s
Note that your example code didn't work; the district names are 'a' and 'b' in your example dataset, but you generate names 'j' through 't'. I fixed that in my code above.
And it will cause an error if names of cities, districts and/or streets overlap. For instance, if one row is in the district 'b', and in the street 'cc', stri_detect_fixed will also see the 'c' and think it is in 'c'. I propose a completely different method to overcome this:
Alternative method
Given your example data, it makes most sense to first split the given address by ,
, then look for exact matches with your reference city/district/street names. We can look for those exact matches with intersect()
.
# example reference address parts
cities <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L",
"M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y",
"Z")
districts <- c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k")
streets <- c("aa", "bb", "cc", "dd")
# example dataset
df <- data.frame(x = c(1:5),
address = c("A, b, cc,", "B, dd", "a, dd", "C", "D, a, cc"))
# vectorize address into elements
address_elems = strsplit(df$address, ',') # split by comma
address_elems = sapply(address_elems, trimws) # trim whitespace
Compare df$address
and the newly created address_elems
:
> df$address
[1] "A, b, cc," "B, dd" "a, dd" "C" "D, a, cc"
> address_elems
[[1]]
[1] "A" "b" "cc"
[[2]]
[1] "B" "dd"
[[3]]
[1] "a" "dd"
[[4]]
[1] "C"
[[5]]
[1] "D" "a" "cc"
We could find matching cities
for just the first vector in address_elems
in with intersect(cities, address_elems[[1]])
.
Because we might get multiple matches, we only take the first element, with intersect(cities, address_elems[[1]])[[1]]
.
To apply this to every vector in address_elems
, we can use sapply()
or lapply()
:
# intersect the respective reference lists with each list of
# address items, taking only the first element
df$cities = sapply(address_elems, function(x) intersect(cities, x)[1])
df$district = sapply(address_elems, function(x) intersect(districts, x)[1])
df$street = sapply(address_elems, function(x) intersect(streets, x)[1])
PIAT
Putting it all together we get:
# example reference address parts
cities <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L",
"M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y",
"Z")
districts <- c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k")
streets <- c("aa", "bb", "cc", "dd")
# example dataset
df <- data.frame(x = c(1:5),
address = c("A, b, cc,", "B, dd", "a, dd", "C", "D, a, cc"))
# create vector of address elements
address_elems = strsplit(df$address, ',') # split by comma
address_elems = sapply(address_elems, trimws) # trim whitespace
# intersect the respecitve reference lists with each list of
# address items, take only the first element
df$cities = lapply(address_elems, function(x) intersect(cities, x)[1])
df$district = sapply(address_elems, function(x) intersect(districts, x)[1])
df$street = sapply(address_elems, function(x) intersect(streets, x)[1])
# cleanup
rm(address_elems)