I'm hoping someone can help me figure out how to modify one variable multiple times in data.table
, or find a similar approach that would work for big data.
I have a dataset with strings (addresses to be exact, but the exact contents aren't important), such as:
library(data.table)
library(stringr)
# example addresses although you can imagine other types of strings here
addr <- data.table(street = c('1 main street',
'99 madison avenue',
'340 circle court'))
I have another dataset with a column of patterns that I want to search for patterns in these strings (i.e. in the addr
dataset) and substitute with other strings kept in another column in this second dataset. For example:
# example of patterns to search for and what I want to replace them with
abbrev <- data.table(full = c('street', 'avenue', 'circle', 'court'),
abbrev = c('st', 'ave', 'cir', 'ct'))
The actual datasets are much larger: millions of addresses and 300+ abbreviations I want to check each address for.
It'd be fairly simple to do this in a loop, but because of the size, I'd like to use data.table
and probably an apply
function to make this process more efficient.
I'm struggling to figure out how to write this exactly. I want something like the following:
# duplicate addresses so we can compare to changes
addr[, orig.street := street]
# function to substitute abbreviations we want
standardize <- function(word, shorter) {
addr[, street := str_replace_all(street,
paste0(" ", word),
paste0(" ", shorter))]
}
# now run function for all abbreviations we want
addr[, street := mapply(FUN = standardize,
word = abbrev$full,
shorter = abbrev$abbrev,
SIMPLIFY = FALSE, USE.NAMES = FALSE)]
When I run this in Rstudio, this is returning the error, "Supplied 4 items to be assigned to 3 items of column 'street'. The RHS length must either be 1 (single values are ok) or match the LHS length exactly. If you wish to 'recycle' the RHS please use rep() explicitly to make this intent clear to readers of your code."
However it actually does give me what I want, despite the error:
# it breaks but I do get the desired outcome:
street orig.street
1: 1 main st 1 main street
2: 99 madison ave 99 madison avenue
3: 340 cir ct 340 circle court
I feel like there must be a solution I'm missing, but I haven't figured it out. Any help would be greatly appreciated.