1

I have 2 dataframe: DF1

ID   Address
AB1  VILL +PO CHAPAR TAPUKADA  ALWAR
AB2  VILL WARD NO 02 THIKARIYA CHAND RAWAT JUNA PADA POST BADANA  0 SIROHI
AB3  RAMKUMAR YADAV VILL  KANSL   0 JAIPUR
AB4  VILL KHERKI MUKKER  POSTPANIYA PUTLI   JAIPUR

and, df2

    Name
    CHHAPPAR
    CHHAPAR
    KANSAL
    KANSIL
    KANSOL
    KHERK
    KHERKIA
    PAR
    UR
   WAR
   RIYA
   DAV
   LI

I want to apply fuzzy logic in DF1 string. If the names given in DF1 matches with DF2, give me the DF2 name

Output should be like

ID   Address                                                                 Name
AB1  VILL +PO CHAPAR TAPUKADA  ALWAR                                         CHHAPPAR, CHHAPAR
AB2  VILL WARD NO 02 THIKARIYA CHAND RAWAT JUNA PADA POST BADANA  0 SIROHI
AB3  RAMKUMAR YADAV VILL  KANSL   0 JAIPUR                                   KANSAL, KANSIL, KANSOL
AB4  VILL KHERKI MUKKER  POSTPANIYA PUTLI   JAIPUR                           KHERK, KHERKIA

I tried applying FuzzywuzzyR but it's given an error

I tried agrep too, but it's giving me result as True/False.

Please help me out in this. Also, if I should try other packages for fuzzy

1 Answers1

2

I would use the package fuzzyjoin for this, which works with the logic from tidytext:

library(tidytext)
library(fuzzyjoin)
library(tidyverse)

df1 %>% 
  unnest_tokens(word, Address, to_lower = FALSE) %>% 
  fuzzyjoin::stringdist_left_join(df2, by = c("word" = "Name"), max_dist = 1) %>% 
  group_by(ID) %>% # collapse unnested tokens back to text if you want
  summarise(text = paste(word, collapse = " "),
            Name = toString(na.omit(Name)))
#> # A tibble: 4 x 3
#>   ID    text                                                 Name               
#>   <chr> <chr>                                                <chr>              
#> 1 AB1   VILL PO CHAPAR TAPUKADA ALWAR                        "CHHAPAR"          
#> 2 AB2   VILL WARD NO 02 THIKARIYA CHAND RAWAT JUNA PADA POS~ ""                 
#> 3 AB3   RAMKUMAR YADAV VILL KANSL KANSL KANSL 0 JAIPUR       "KANSAL, KANSIL, K~
#> 4 AB4   VILL KHERKI KHERKI MUKKER POSTPANIYA PUTLI JAIPUR    "KHERK, KHERKIA"

data

df1 <- read.csv(text = "ID,Address
AB1,VILL +PO CHAPAR TAPUKADA  ALWAR
AB2,VILL WARD NO 02 THIKARIYA CHAND RAWAT JUNA PADA POST BADANA  0 SIROHI
AB3,RAMKUMAR YADAV VILL  KANSL   0 JAIPUR
AB4,VILL KHERKI MUKKER  POSTPANIYA PUTLI   JAIPUR", stringsAsFactors = FALSE)

df2 <- read.csv(text = "Name
CHHAPPAR
CHHAPAR
KANSAL
KANSIL
KANSOL
KHERK
KHERKIA", stringsAsFactors = FALSE)
JBGruber
  • 11,727
  • 1
  • 23
  • 45
  • I am getting an Error in check_input(x) : Input must be a character vector of any length or a list of character vectors, each of which has a length of 1. – SiddhiKharkia Apr 22 '20 at 14:42
  • Is the column a factor maybe? In that case use `df1$Address <- as.character(df1$Address)`. – JBGruber Apr 22 '20 at 14:47
  • I am still getting the error. My address is converted like : R O O RADHYSHYAM SEN SEN SEN SEN SEN SEN R O O DEV DEV DEV DEV DEV DEV KHERA KHERA KHERA KHERA KHERA KHERA KHERA KHERA KHERA KHERA KHERA KHERA KHERA KHERA KHERA KHERA KHERA KHERA PO CHAPREL CHAPREL CHAPREL CHAPREL BHILWARA BHILWARA BHILWARA BHILWARA BHILWARA BHILWARA and, the names df : OR, OR, OD, SEU, SEH, REN, SENA, BEN, VEN, OR, OR, OD, DEH, DEU, DAV, DER, DEI, DET, KHERI, KHARA, KHORA, KHERLA, KHERA, KHERKA, KHEDA, GHERA, KHEMRA, JHERA, KHERAT, KHERWA, KHER, AHERA, KHERAN, KHERDA, KHEJRA, KHERAD – SiddhiKharkia Apr 22 '20 at 16:00
  • Sorry, no idea. It works with your example data as demonstrated. Maybe you need to share a chunk of your actual data with `dput()`. Update it in the question and let me know. – JBGruber Apr 22 '20 at 16:06