which R function to use for Text Auto-Correction?

Question

I have a csv Document with 2 columns which contains Commodity Category and Commodity Name.

Ex:

Sl.No. Commodity Category Commodity Name
1      Stationary         Pencil
2      Stationary         Pen
3      Stationary         Marker
4      Office Utensils    Chair
5      Office Utensils    Drawer
6      Hardware           Monitor
7      Hardware           CPU

and I have another csv file which contains various Commodity names.

Ex:

Sl.No. Commodity Name
1      Pancil
2      Pencil-HB 02
3      Pencil-Apsara
4      Pancil-Nataraj
5      Pen-Parker
6      Pen-Reynolds
7      Monitor-X001RL

The output I would like is to standardise and categorise the commodity names and classify them into respective Commodity Categories like shown below :

Sl.No. Commodity Name   Commodity Category
1      Pencil           Stationary
2      Pencil           Stationary
3      Pencil           Stationary
4      Pancil           Stationary
5      Pen              Stationary
6      Pen              Stationary
7      Monitor          Hardware

Step 1) I first have to use NLTK (Text mining methods) and clean the data so as to seperate "Pencil" from "Pencil-HB 02" .

Step 2) After cleaning I have to use Approximate String match technique i.e agrep() to match the patterns "Pencil *" or correcting "Pancil" to "Pencil".

Step 3)Once correcting the pattern I have to categorise. No idea how.

This is what I have thought about. I started with step 2 and I'm stuck in step 2 only. I'm not finding an exact method to code this. Is there any way to get the output as required? If yes please suggest me the method I can proceed with.

Randy Lai · Accepted Answer · 2015-04-23T07:53:17.210

You could use the stringdist package. The correct function below will correct the Commodity.Name in file2 based on distances of the item to different CName.

Then a left_join is used to join the two tables.

I also notice that there are some classifications if I use the default options for stringdistmatrix. You can try changing the weight argument of stringdistmatrix for better correction result.

> library(dplyr)
> library(stringdist)
> 
> file1 <- read.csv("/Users/Randy/Desktop/file1.csv")
> file2 <- read.csv("/Users/Randy/Desktop/file2.csv")
> 
> head(file1)
  Sl.No. Commodity.Category Commodity.Name
1      1         Stationary         Pencil
2      2         Stationary            Pen
3      3         Stationary         Marker
4      4    Office Utensils          Chair
5      5    Office Utensils         Drawer
6      6           Hardware        Monitor
> head(file2)
  Sl.No. Commodity.Name
1      1         Pancil
2      2   Pencil-HB 02
3      3  Pencil-Apsara
4      4 Pancil-Nataraj
5      5     Pen-Parker
6      6   Pen-Reynolds
> 
> CName <- levels(file1$Commodity.Name)
> correct <- function(x){
+     factor(sapply(x, function(z) CName[which.min(stringdistmatrix(z, CName, weight=c(1,0.1,1,1)))]), CName)
+ }
> 
> correctedfile2 <- file2 %>%
+ transmute(Commodity.Name.Old = Commodity.Name, Commodity.Name = correct(Commodity.Name))
> 
> correctedfile2 %>%
+ inner_join(file1[,-1], by="Commodity.Name")
  Commodity.Name.Old Commodity.Name Commodity.Category
1             Pancil         Pencil         Stationary
2       Pencil-HB 02         Pencil         Stationary
3      Pencil-Apsara         Pencil         Stationary
4     Pancil-Nataraj         Pencil         Stationary
5         Pen-Parker            Pen         Stationary
6       Pen-Reynolds            Pen         Stationary
7     Monitor-X001RL        Monitor           Hardware

If you need the "Others" category, you just need to play with the weights. I added a row "Diesel" in file2. Then compute the score using stringdist with customized weights (you should try varying the values). If the score is large than 2 (this value is related to how the weights are assigned), it doesn't correct anything.

PS: as we don't know all the possible labels, we have to do as.character to convect factor to character.

PS2: I am also using tolower for case insensitive scoring.

> head(file2)
  Sl.No. Commodity.Name
1      1         Diesel
2      2         Pancil
3      3   Pencil-HB 02
4      4  Pencil-Apsara
5      5 Pancil-Nataraj
6      6     Pen-Parker
> 
> CName <- levels(file1$Commodity.Name)
> CName.lower <- tolower(CName)
> correct_1 <- function(x){
+     scores = stringdistmatrix(tolower(x), CName.lower, weight=c(1,0.001,1,0.5))
+     if (min(scores)>2) {
+         return(x)
+     } else {
+         return(as.character(CName[which.min(scores)]))
+     }
+ }
> correct <- function(x) {
+     sapply(as.character(x), correct_1)
+ }
> 
> correctedfile2 <- file2 %>%
+ transmute(Commodity.Name.Old = Commodity.Name, Commodity.Name = correct(Commodity.Name))
> 
> file1$Commodity.Name = as.character(file1$Commodity.Name)
> correctedfile2 %>%
+ left_join(file1[,-1], by="Commodity.Name")
  Commodity.Name.Old Commodity.Name Commodity.Category
1             Diesel         Diesel               <NA>
2             Pancil         Pencil         Stationary
3       Pencil-HB 02         Pencil         Stationary
4      Pencil-Apsara         Pencil         Stationary
5     Pancil-Nataraj         Pencil         Stationary
6         Pen-Parker            Pen         Stationary
7       Pen-Reynolds            Pen         Stationary
8     Monitor-X001RL        Monitor           Hardware

Thanks for your reply Randy. Can you suggest me a method to handle unmatched cases? For Ex: if i have 'Petrol' or 'Diesel' or 'Truck' as the commodity names in file 2, it should come under Others category. So for that i cannot add a long list of commodity names in file 1. It would be greatful if you could suggest me a method to handle this case. Also in agrep we have max.distance which will tell me the % of match. Is there anything as such in StringDist?.. — Viamia, Apr 23 '15 at 06:57
Thanks Randy. Since I don't have the latest version of R I'm facing issues in installing the stringDist package. I'll try executing using the method you've suggested and get back. Thanks a lot again. — Viamia, Apr 23 '15 at 11:23
Everything is working Fine like you have explained. Great Code. Thanks a lot Randy!.. — Viamia, Apr 28 '15 at 09:46
Hi Randy, I want to design a front end for the above classification in HTML. Instead of providing file 2 as excel,I want to enter the commodity name in the HTML document and submit. But I'm confused how to send that form data to R code so that it checks with file 1. I have used R Markdown and knitr package for this and got stuck. Please let me know if it's possible to proceed as i have thought. — Viamia, May 20 '15 at 07:05
hi, ive tried this method with different arrangement and different number of commodity name but it is failed.. any suggestion ? — SongJL, Jun 09 '22 at 07:01

Tamas Nagy · Answer 2 · 2018-01-10T09:36:08.453

There is an 'Approximate string matching' function amatch() in {stingdist} (at least in 0.9.4.6) that returns the most probable match from the pre-defined set of words. It has a parameter maxDist that can be set for the maximum distance to be matched, and a nomatch parameter that can be used for the 'other' category. Otherwise, method, weights, etc. can be set similarly to stringdistmatrix().

So, your original problem can be solved like this using a tidyverse compatible solution:

library(dplyr)
library(stringdist)

# Reading the files
file1 <- readr::read_csv("file1.csv")
file2 <- readr::read_csv("file2.csv")

# Getting the commodity names in a vector    
commodities <- file1 %>% distinct(`Commodity Name`) %>% pull()

# Finding the closest string match of the commodities, and joining the file containing the categories 
file2 %>%
    mutate(`Commodity Name` = commodities[amatch(`Commodity Name`, commodities, maxDist = 5)]) %>%
    left_join(file1, by = "Commodity Name")

This will return a data frame that contains the corrected commodity name and category. If the original Commodity name is more than 5 characters away (simplified explanation of string distance) from any of the possible commodity names, the corrected name will be NA.

which R function to use for Text Auto-Correction?

2 Answers2