base R alternative
Here's a somewhat faster and more-direct approach using sapply
(and no for
loops), relying on the fact that grepl
can be vectorized on x=
. (It is not vectorized on pattern=
, requiring that to be length 1, which is one reason why we need the sapply
at all.)
matches <- sapply(input.strings, grepl, x = polic$policy_label)
matches
# seed fertilizer fertiliser loan interest feed insurance
# [1,] TRUE FALSE FALSE FALSE FALSE FALSE FALSE
# [2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [3,] FALSE TRUE FALSE FALSE FALSE FALSE FALSE
# [4,] FALSE FALSE FALSE TRUE FALSE FALSE FALSE
# [5,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [6,] FALSE FALSE FALSE FALSE FALSE TRUE FALSE
Because we want to assign "others"
to everything without a match (and because we will need at least one TRUE
in
matches <- cbind(matches, others = rowSums(matches) == 0)
matches
# seed fertilizer fertiliser loan interest feed insurance others
# [1,] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
# [3,] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
# [4,] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
# [5,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
# [6,] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
From here, we can find the names associated with the true values and assign them (optionally ,
-collapsed) into polic
:
polic$policy_class <- apply(matches, 1, function(z) toString(colnames(matches)[z]))
polic
# policy_label policy_class
# 1 seed supply seed
# 2 energy subsidy others
# 3 fertilizer distribution fertilizer
# 4 loan guarantee loan
# 5 Interest waiver others
# 6 feed purchase feed
FYI, the reason I used toString
is because I did not want to assume that there would always be no more than one match; that is, if two input.strings
matched one policy_label
for whatever reason, than toString
will combine them into one string, e.g., "seed, feed"
for multi-match policies.
fuzzyjoin alternative
If you're familiar with merges/joins (and What's the difference between INNER JOIN, LEFT JOIN, RIGHT JOIN and FULL JOIN?), then this should seem familiar. If not, the concept of joining data in this way can be transformative to data-munging/cleaning.
library(fuzzyjoin)
out <- regex_left_join(
polic, data.frame(policy_class = input.strings),
by = c("policy_label" = "policy_class"))
out
# policy_label policy_class
# 1 seed supply seed
# 2 energy subsidy <NA>
# 3 fertilizer distribution fertilizer
# 4 loan guarantee loan
# 5 Interest waiver <NA>
# 6 feed purchase feed
### clean up the NAs for "others"
out$policy_class[is.na(out$policy_class)] <- "others"
In contrast to the base-R variant above, there is no safe-guard here (yet!) to handle when multiple input.strings
match one policy_label
; when that happens, that row with a match will be duplicated, so you'd see (e.g.) seed supply
and all other columns on that row twice. This can easily be mitigated given some effort.