1

I wrote this lines of code to work in a dataframe that returns a new column with case insensitive match with the elements of string list.

However, the resulting column works for the first element of the list only, 'seed' in this case, but not other match. Not sure where is the wrong in the for loop.

Here is the sample dataframe you may want to check results for.

input.strings <- c('seed', 'fertilizer', 'fertiliser', 'loan', 'interest', 'feed', 'insurance')

polic = data.frame(policy_label=c('seed supply','energy subsidy','fertilizer distribution','loan guarantee','Interest waiver','feed purchase'))

polic$policy_class <- sapply(polic$policy_label, function(x){
                        for (i in input.strings){
                           if (grepl(i, tolower(x))){
                           return(i)
                      }
                      else{
                       return("others")
                      }
                     }
                    }) 
user438383
  • 5,716
  • 8
  • 28
  • 43

1 Answers1

1

base R alternative

Here's a somewhat faster and more-direct approach using sapply (and no for loops), relying on the fact that grepl can be vectorized on x=. (It is not vectorized on pattern=, requiring that to be length 1, which is one reason why we need the sapply at all.)

matches <- sapply(input.strings, grepl, x = polic$policy_label)
matches
#       seed fertilizer fertiliser  loan interest  feed insurance
# [1,]  TRUE      FALSE      FALSE FALSE    FALSE FALSE     FALSE
# [2,] FALSE      FALSE      FALSE FALSE    FALSE FALSE     FALSE
# [3,] FALSE       TRUE      FALSE FALSE    FALSE FALSE     FALSE
# [4,] FALSE      FALSE      FALSE  TRUE    FALSE FALSE     FALSE
# [5,] FALSE      FALSE      FALSE FALSE    FALSE FALSE     FALSE
# [6,] FALSE      FALSE      FALSE FALSE    FALSE  TRUE     FALSE

Because we want to assign "others" to everything without a match (and because we will need at least one TRUE in

matches <- cbind(matches, others = rowSums(matches) == 0)
matches
#       seed fertilizer fertiliser  loan interest  feed insurance others
# [1,]  TRUE      FALSE      FALSE FALSE    FALSE FALSE     FALSE  FALSE
# [2,] FALSE      FALSE      FALSE FALSE    FALSE FALSE     FALSE   TRUE
# [3,] FALSE       TRUE      FALSE FALSE    FALSE FALSE     FALSE  FALSE
# [4,] FALSE      FALSE      FALSE  TRUE    FALSE FALSE     FALSE  FALSE
# [5,] FALSE      FALSE      FALSE FALSE    FALSE FALSE     FALSE   TRUE
# [6,] FALSE      FALSE      FALSE FALSE    FALSE  TRUE     FALSE  FALSE

From here, we can find the names associated with the true values and assign them (optionally ,-collapsed) into polic:

polic$policy_class <- apply(matches, 1, function(z) toString(colnames(matches)[z]))
polic
#              policy_label policy_class
# 1             seed supply         seed
# 2          energy subsidy       others
# 3 fertilizer distribution   fertilizer
# 4          loan guarantee         loan
# 5         Interest waiver       others
# 6           feed purchase         feed

FYI, the reason I used toString is because I did not want to assume that there would always be no more than one match; that is, if two input.strings matched one policy_label for whatever reason, than toString will combine them into one string, e.g., "seed, feed" for multi-match policies.

fuzzyjoin alternative

If you're familiar with merges/joins (and What's the difference between INNER JOIN, LEFT JOIN, RIGHT JOIN and FULL JOIN?), then this should seem familiar. If not, the concept of joining data in this way can be transformative to data-munging/cleaning.

library(fuzzyjoin)
out <- regex_left_join(
  polic, data.frame(policy_class = input.strings),
  by = c("policy_label" = "policy_class"))
out
#              policy_label policy_class
# 1             seed supply         seed
# 2          energy subsidy         <NA>
# 3 fertilizer distribution   fertilizer
# 4          loan guarantee         loan
# 5         Interest waiver         <NA>
# 6           feed purchase         feed

### clean up the NAs for "others"
out$policy_class[is.na(out$policy_class)] <- "others"

In contrast to the base-R variant above, there is no safe-guard here (yet!) to handle when multiple input.strings match one policy_label; when that happens, that row with a match will be duplicated, so you'd see (e.g.) seed supply and all other columns on that row twice. This can easily be mitigated given some effort.

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Thanks for the help. I tried with this one line of code: ```str_extract_all(polic$policy_label, paste(input.strings, collapse = "|"))``` but it is not case insensitive thus 'Interest' doesn't return interest. – Abdullah Mamun Jan 24 '22 at 20:18