1

I'm hoping someone can help me figure out how to modify one variable multiple times in data.table, or find a similar approach that would work for big data.

I have a dataset with strings (addresses to be exact, but the exact contents aren't important), such as:

library(data.table)
library(stringr)

# example addresses although you can imagine other types of strings here
addr <- data.table(street = c('1 main street', 
                              '99 madison avenue',
                              '340 circle court'))

I have another dataset with a column of patterns that I want to search for patterns in these strings (i.e. in the addr dataset) and substitute with other strings kept in another column in this second dataset. For example:

# example of patterns to search for and what I want to replace them with
abbrev <- data.table(full = c('street', 'avenue', 'circle', 'court'),
                     abbrev = c('st', 'ave', 'cir', 'ct')) 

The actual datasets are much larger: millions of addresses and 300+ abbreviations I want to check each address for.

It'd be fairly simple to do this in a loop, but because of the size, I'd like to use data.table and probably an apply function to make this process more efficient.

I'm struggling to figure out how to write this exactly. I want something like the following:

# duplicate addresses so we can compare to changes
addr[, orig.street := street]

# function to substitute abbreviations we want
standardize <- function(word, shorter) {
  addr[, street := str_replace_all(street,
                                   paste0(" ", word),
                                   paste0(" ", shorter))]
}

# now run function for all abbreviations we want
addr[, street := mapply(FUN = standardize,
                        word = abbrev$full,
                        shorter = abbrev$abbrev,
                        SIMPLIFY = FALSE, USE.NAMES = FALSE)]

When I run this in Rstudio, this is returning the error, "Supplied 4 items to be assigned to 3 items of column 'street'. The RHS length must either be 1 (single values are ok) or match the LHS length exactly. If you wish to 'recycle' the RHS please use rep() explicitly to make this intent clear to readers of your code."

However it actually does give me what I want, despite the error:

# it breaks but I do get the desired outcome:
           street       orig.street
1:      1 main st     1 main street
2: 99 madison ave 99 madison avenue
3:     340 cir ct  340 circle court

I feel like there must be a solution I'm missing, but I haven't figured it out. Any help would be greatly appreciated.

dmcd
  • 123
  • 7

2 Answers2

2

You could use stri_replace_all_fixed along with it's argument vectorize_all = FALSE from library(stringi):

library(data.table)
library(stringi)

addr <- data.table(orig_street = c('1 main street', 
                              '99 madison avenue',
                              '340 circle court'))

abbrev <- data.table(full = c('street', 'avenue', 'circle', 'court'),
                     abbrev = c('st', 'ave', 'cir', 'ct')) 

addr[, street := stri_replace_all_fixed(orig_street, abbrev$full, abbrev$abbrev, vectorize_all = FALSE)]

> addr
         orig_street         street
1:     1 main street      1 main st
2: 99 madison avenue 99 madison ave
3:  340 circle court     340 cir ct

Please also see this related answer and note that library(stringr) imports library(stringi).

ismirsehregal
  • 30,045
  • 5
  • 31
  • 78
1

An alternative is a Reduce method:

addr[, street2 := Reduce(function(txt, i) gsub(paste0("\\b", abbrev$full[i], "\\b"), abbrev$abbrev[i], txt),
                         seq_len(nrow(abbrev)), init = street)][]
#               street        street2
#               <char>         <char>
# 1:     1 main street      1 main st
# 2: 99 madison avenue 99 madison ave
# 3:  340 circle court     340 cir ct

Note:

  • I explicitly add word-boundaries (\\b) to the gsub regex so that we don't inadvertently replace a portion of a word. I think we need this instead of fixed=TRUE because gsub("court", "ct", "courteous", fixed = TRUE) returns "cteous".
  • If we tried an apply family (on abbrev), then we would see the updated value for each of the patterns, but not know (without extra work) which one had the change; further, if it's possible (in general, perhaps not here) for more than one abbreviation pattern to be useful, then we need to apply each pattern/replacement on the results of the previous replacement, which *apply cannot do (as easily).
  • Unfortunately, Reduce does not easily iterate over rows of a frame, so we iterate over row indices (seq_len(nrow(abbrev))).

However, I can't help but feel that the last row should really be "340 circle ct". In which case, if we assume that the abbrev is at the end of the string, we can use that instead:

addr[, street3 := Reduce(function(txt, i) gsub(paste0("\\b", abbrev$full[i], "\\s*$"), abbrev$abbrev[i], txt), 
                         seq_len(nrow(abbrev)), init  = street)][]
#               street        street2        street3
#               <char>         <char>         <char>
# 1:     1 main street      1 main st      1 main st
# 2: 99 madison avenue 99 madison ave 99 madison ave
# 3:  340 circle court     340 cir ct  340 circle ct
r2evans
  • 141,215
  • 6
  • 77
  • 149