Match by argument pattern and left join

Question

I am trying to match city/county names (fortunately information on state names are provided) with their corresponding state names and then append their first three phone digit as another column using left_join(). My initial thought would be to replicate the city/county name column and then replacing them with their state names using sapply() along with grep(), then using left_join() to merge it with phone digit column, but it doesn't seem that my code works.

library(dplyr)

location <- data.frame(location = c('Asortia, New York', 'Buffalo, New York', 'New York, New York',  'Alexandra, Virginia', 'Fairfax, Virginia', 'Baltimore, Maryland', 'Springfield, Maryland'), number = c(100, 200, 300, 400, 500, 600, 700))

state <- data.frame(state = c('New York', 'Virginia', 'Maryland'))

sapply(as.character(state$state), function(i) grep(i, location$location))

### doesn't work! ###
### my desired output would be ###

  location number
1 New York    100
2 New York    200
3 New York    300
4 Virginia    400
5 Virginia    500
6 Maryland    600
7 Maryland    700

Such that I could use left_join to merge the output generated from above with their three digit phone number. For example,

df <- location
names(df)[1] <- 'state'
digit <- data.frame(state = c('New York', 'Virginia', 'Maryland'), digit = c(212, 703, 410))
   
new_df <- left_join(df, digit, by = 'state')

### the desired output ###

  location number digit
1 New York    100   212
2 New York    200   212
3 New York    300   212
4 Virginia    400   703
5 Virginia    500   703
6 Maryland    600   410
7 Maryland    700   410

I have referenced this and this thread, but didn't quite get the clue. Hope someone could help me on this.

## Update

I found that using grepl in a for loop also works, but the processing may be slow if you have large amount of data (the data I'm working on has two million observations).

for (i in state$state) { 
location$location[grepl(i, location$location)] <- i
}

akrun · Accepted Answer · 2019-11-26T18:57:02.530

2

May be we can use str_remove by pasteing (str_c) the pattern vector in 'state' column from 'state' dataset as a regex lookaround to match anything that precedes the vector (to remove)

library(stringr)
library(dplyr)
location %>%
    mutate(location = str_remove(location, str_c(".*(?=(",
            str_c(state$state, collapse  = "|"), "))")))
#  location number
#1 New York    100
#2 New York    200
#3 New York    300
#4 Virginia    400
#5 Virginia    500
#6 Maryland    600
#7 Maryland    700

Or another option is to separate into two column and remove the first

library(tidyr)
location %>%
   separate(location, into = c('unwanted', 'location'), sep=",\\s*") %>% 
   select(-unwanted)

Or if we have a specific pattern, remove the prefix part by matching one or more characters that are not a , from the start (^) followed by , and zero or more spaces (\\s*) as pattern in the str_remove

location %>% 
    mutate(location = str_remove(location, '^[^,]+,\\s*'))

edited Nov 26 '19 at 18:57

answered Nov 26 '19 at 18:48

akrun

874,273
37
540
662

Hi, thanks for your reply. I actually want to remove the city/county names but retain the state names (and without the ","). – Chris T. Nov 26 '19 at 18:50
1

Thanks a lot, this really helps. – Chris T. Nov 26 '19 at 18:56
Sorry to hear that, but your answer really helps. – Chris T. Nov 26 '19 at 19:19
I found that using `grepl` in a `for loop` to replace original entries also works, but the processing may be slow if you have large amount of data. – Chris T. Nov 27 '19 at 01:33
@ChrisT. You can replace with `stri_detect` from `stringi` – akrun Nov 27 '19 at 17:54

score 1 · Answer 2 · answered Nov 26 '19 at 19:09

You can use str_match, map, and unite:

library(tidyverse)

location$state <- map_df(state, ~str_match(location$location, .x)) %>% 
                  unite("state", na.rm=T) %>% 
                  pull()

left_join(location, digit, by = "state") %>% 
  select(state, number, digit)

     state number digit
1 New York    100   212
2 New York    200   212
3 New York    300   212
4 Virginia    400   703
5 Virginia    500   703
6 Maryland    600   410
7 Maryland    700   410

Match by argument pattern and left join

2 Answers2

Linked