1

Say I have this df:

df <- data.frame(address = c('123 Harrison St', '456 Circle Dr.', '345 Round Blvd'))

I'd like to convert the street type abbreviations to the full word, however I'm not sure that every street type will be listed in the df (maybe the df contains addresses with 'ln', 'ave', but not 'blvd', etc). The final output would look like this:

'123 Harrison Street'
'456 Circle Drive'
'345 Round Boulevard'

I've tried using this, but I get a warning message that the replacement length > 1 and only the first element will be used:

abbr <- c('St'= 'Street', 'Dr' = 'Drive', 'Blvd' = 'Boulevard', 'Ln' = 'Lane')
pattern <- paste0("\\b(", paste0(abbr, collapse = "|"), ")\\b")
df$address <- gsub(pattern, abbr, df$address, ignore.case = TRUE)

My question is two-fold:

1.) why does it throw the error when the correct abbreviations are in the abbr variable?
2.) how can I make the code work to account for abbreviations that are in the abbr variable but not in the df?

TIA.

jay.sf
  • 60,139
  • 8
  • 53
  • 110
MKN17
  • 61
  • 7
  • Check [this solution](https://stackoverflow.com/a/49533299/3832970). Also, see [Dictionary style replace multiple items](https://stackoverflow.com/q/7547597/3832970). – Wiktor Stribiżew Jun 19 '23 at 18:52
  • 1
    And mind your regex looks like `\b(Street|Drive|Boulevard|Lane)\b`, i.e. it is an alternation of values, not keys. You need `pattern <- paste0("\\b(", paste0(names(abbr), collapse = "|"), ")\\b")` and then `df$address <- stringr::str_replace_all(df$address, pattern, function(m) abbr[m][[1]])` – Wiktor Stribiżew Jun 19 '23 at 19:02

2 Answers2

1

the replacement argument in gsub expects a named character vector for multiple replacements. you can try this:

df <- data.frame(address = c('123 Harrison St', '456 Circle Dr.', '345 Round Blvd'))

abbr <- c('St' = 'Street', 'Dr.' = 'Drive', 'Blvd' = 'Boulevard', 'Ln' = 'Lane')
pattern <- paste0("\\b(", paste0(names(abbr), collapse = "|"), ")\\b")

df$address <- sapply(df$address, function(x) {
  words <- unlist(strsplit(x, " "))
  replaced <- sapply(words, function(word) {
    if (word %in% names(abbr)) {
      return(abbr[word])
    } else {
      return(word)
    }
  })
  paste(replaced, collapse = " ")
})

print(df$address)
Phoenix
  • 1,343
  • 8
  • 10
1

You can use stri_replace_all_regex from the stringi package. get rid of the ending points first.

stringi::stri_replace_all_regex(str=sub('\\.$', '', df$address),
                                pattern=c("St$", "Dr$", "Blvd$", "Ln$"), 
                                replacement=c("Street", "Drive", "Boulevard", "Lane"),
                                vectorize_all=FALSE)
# [1] "123 Harrison Street" "456 Circle Drive"    "345 Round Boulevard"

Or, an outer approach without package

df$address <- sub('\\.$', '', df$address)
o <- outer(names(abbr), df$address, Vectorize(grepl)) |> apply(2, which)
Vectorize(gsub)(paste(names(abbr), collapse='|'), abbr[unlist(replace(o, lengths(o) == 0, NA))], df$address) |> unname()
# [1] "123 Harrison Street" "456 Circle Drive"    "345 Round Boulevard"
jay.sf
  • 60,139
  • 8
  • 53
  • 110