Combine factor levels with fuzzy/partial character string match in R

Question

I have a data set that includes DMA (Designated Market Area), but many of the DMAs show up as two different levels because the DMA is truncated, e.g. the DMA "Abilene-Sweetwater, TX" sometimes shows up as "Abilene-Sweetw"

Here is a snippet of the data set:

dma <- c("Abilene-Sweetw", "Abilene-Sweetwater, TX", 
         "Albany, GA", "Albany, GA", 
         "Albany-Schenec", "Albany-Schenec", 
         "Albany-Schenectady-Troy, NY", "Albany-Schenectady-Troy, NY")
cost <- c(0.46, 0.46, 0.45, 0.45, 0.32, 0.32, 0.32, 0.32)

DMA.df <- data.frame(dma, cost)

DMA.df
dma cost
1              Abilene-Sweetw 0.46
2      Abilene-Sweetwater, TX 0.46
3                  Albany, GA 0.45
4                  Albany, GA 0.45
5              Albany-Schenec 0.32
6              Albany-Schenec 0.32
7 Albany-Schenectady-Troy, NY 0.32
8 Albany-Schenectady-Troy, NY 0.32

Searches on SO and elsewhere turn up solutions that show how to manually combine multiple factor levels into one. Obviously I don't want to manually do this.

I am looking for a way to fix the truncated DMA and turn it into the "full" DMA (city-...-, state). One saving grace is that there is a pattern to the truncation - it cuts off at 14 letters. The solution would need to match all 14 characters because many DMAs start with the same name (e.g. "Albany, GA" and "Albany-..., NY").

Put another way, I need to find all the truncated DMAs that match the complete DMAs and turn the truncated DMA into the complete DMA.

Sample DF should look like this:

             dma cost
1      Abilene-Sweetwater, TX 0.46
2      Abilene-Sweetwater, TX 0.46
3                  Albany, GA 0.45
4                  Albany, GA 0.45
5 Albany-Schenectady-Troy, NY 0.32
6 Albany-Schenectady-Troy, NY 0.32
7 Albany-Schenectady-Troy, NY 0.32
8 Albany-Schenectady-Troy, NY 0.32

Thanks in advance for any suggestions.

`DMA.df$dma2 <- substring(DMF.df$dma, 1, 14)` I think you just need to truncate everything... then you've found your matches, right? — cory, Nov 07 '19 at 19:47
@cory - Thanks for the suggestion. Truncating all the DMAs to 14 letters is easy enough, but I want to take the truncated DMAs and convert them to the "**full**" DMA. Just to clarify, this is a sample DF, with the full DF containing over 100M rows with 210 DMAs (and it appears that all the long ones are duplicated with a truncated match). — James, Nov 07 '19 at 20:24

score 0 · Answer 1 · answered Nov 12 '19 at 18:24

The simplest solution using Base r:: substring and merge and dplyr::select and mutate:

#sample (and problematic) df with some DMAs truncated and others full-length
dma <- c("Abilene-Sweetw", "Abilene-Sweetwater, TX", 
         "Albany, GA", "Albany, GA", 
         "Albany-Schenec", "Albany-Schenec", 
         "Albany-Schenectady-Troy, NY", "Albany-Schenectady-Troy, NY")
cost <- c(0.46, 0.46, 0.45, 0.45, 0.32, 0.32, 0.32, 0.32)


DMA.df <- data.frame(dma, cost, stringsAsFactors = FALSE)
                         dma cost
1              Abilene-Sweetw 0.46
2      Abilene-Sweetwater, TX 0.46
3                  Albany, GA 0.45
4                  Albany, GA 0.45
5              Albany-Schenec 0.32
6              Albany-Schenec 0.32
7 Albany-Schenectady-Troy, NY 0.32
8 Albany-Schenectady-Troy, NY 0.32

#create a column where ALL the DMAs are truncated to the same length
DMA.df <- DMA.df %>% 
  mutate(dma_truncated = substring(dma, 1, 13)) %>% 
  select(-dma) #drop the orginal 'DMA' column
cost dma_truncated
1 0.46 Abilene-Sweet
2 0.46 Abilene-Sweet
3 0.45    Albany, GA
4 0.45    Albany, GA
5 0.32 Albany-Schene
6 0.32 Albany-Schene
7 0.32 Albany-Schene
8 0.32 Albany-Schene

#Create a lookup table where the truncated DMA is paired with the full DMA
dma_master <- c("Abilene-Sweetwater, TX",  
                "Albany, GA", 
                "Albany-Schenectady-Troy, NY")
dma_truncated <- substring(dma_master, 1, 13)
DMA_lookup.df <- data.frame(dma_truncated, dma_master, stringsAsFactors = FALSE)

dma_truncated                  dma_master
1 Abilene-Sweet      Abilene-Sweetwater, TX
2    Albany, GA                  Albany, GA
3 Albany-Schene Albany-Schenectady-Troy, NY


#Use MERGE to create the desired column of 'DMA' in the original DF
full_DMA.df <- merge(DMA_lookup.df, DMA.df, by='dma_truncated') %>% 
  select(-dma_truncated) #drop the truncated DMA column

dma_master cost
1      Abilene-Sweetwater, TX 0.46
2      Abilene-Sweetwater, TX 0.46
3 Albany-Schenectady-Troy, NY 0.32
4 Albany-Schenectady-Troy, NY 0.32
5 Albany-Schenectady-Troy, NY 0.32
6 Albany-Schenectady-Troy, NY 0.32
7                  Albany, GA 0.45
8                  Albany, GA 0.45

This is the SO post that basically solved my problem: How to do vlookup and fill down (like in Excel) in R?

score 0 · Answer 2 · answered Apr 17 '20 at 12:08

I've posted a function on github xfactor that uses regex matching to change factor levels and can accomplish the above. Install using devtools::install_github("jwilliman/xfactor"). The levels argument contains the desired regex expressions (truncated DMA) and the labels expression the desired output (full DMA codes).


library(xfactor)

dma <- c("Abilene-Sweetw", "Abilene-Sweetwater, TX", 
         "Albany, GA", "Albany, GA", 
         "Albany-Schenec", "Albany-Schenec", 
         "Albany-Schenectady-Troy, NY", "Albany-Schenectady-Troy, NY")
cost <- c(0.46, 0.46, 0.45, 0.45, 0.32, 0.32, 0.32, 0.32)

DMA.df <- data.frame(dma, cost)


 within(DMA.df, {
   dma = xfactor::xfactor(
     dma, 
     levels = c("Abilene", "Albany, GA", "Albany-Schenec"),
     labels = c("Abilene-Sweetwater, TX", "Albany, GA", "Albany-Schenectady-Troy, NY")
   )
 })
#>                           dma cost
#> 1      Abilene-Sweetwater, TX 0.46
#> 2      Abilene-Sweetwater, TX 0.46
#> 3                  Albany, GA 0.45
#> 4                  Albany, GA 0.45
#> 5 Albany-Schenectady-Troy, NY 0.32
#> 6 Albany-Schenectady-Troy, NY 0.32
#> 7 Albany-Schenectady-Troy, NY 0.32
#> 8 Albany-Schenectady-Troy, NY 0.32

^{Created on 2020-04-18 by the reprex package (v0.3.0)}

Combine factor levels with fuzzy/partial character string match in R

2 Answers2