-2

I need to achieve values in the VAR1_STRUCTURED without manually inputting all possible VAR1 values, since I got 50000 observations, means 50000 possible cases.

Var1                   Var1_Structured
125 Hollywood St.      125 Hollywood St.
125 Hllywood St.       125 Hollywood St.
125 Hollywood St       125 Hollywood St.
Target Store           Target Store
Trget Stre             Target Store
Target. Store          Target Store
T argetStore           Target Store
Walmart                Walmart
Walmart Inc.           Walmart
Wal marte              Walmart

and there's a lot more values under...

mnms2
  • 21
  • 4
  • Welcome to SO. Please read [What topics can I ask about](http://stackoverflow.com/help/on-topic) and [How to ask a good question](http://stackoverflow.com/help/how-to-ask) And [the perfect question](http://codeblog.jonskeet.uk/2010/08/29/writing-the-perfect-question/) And how to create a [Minimal, Complete and Verifiable example](http://stackoverflow.com/help/mcve) SO is **not a free Coding or Code Conversion or Debugging or Tutorial or Library Finding service** You also have to show that you have made some effort to solve your own problem. – RiggsFolly Oct 23 '16 at 14:13
  • 3
    One possibility: levenshtein distance. – Gordon Linoff Oct 23 '16 at 14:25
  • This might be a start http://stackoverflow.com/questions/38956241/remove-duplicates-from-list-based-on-semantic-similarity-relatedness/38960345#38960345 – Steven Beaupré Oct 23 '16 at 15:37

2 Answers2

9

Your question is really imprecise. Please, follow @RiggsFolly suggestions and read the references on how to ask a good question.

Also, as suggested by @DuduMarkovitz, you should start by simplifying the problem and cleaning your data. A few resources to get you started:

Once you are satisfied with the results, you could then proceed to identify a group for each Var1 entry (this will help you down the road to perform further analysis/manipulations on similar entries) This could be done in many different ways but as per mentioned by @GordonLinoff, one possibily is the Levenshtein Distance.

Note: for 50K entries, the result won't be 100% accurate as it will not always categorize the terms in the appropriate group but this should considerably reduce manual efforts.

In R, you could do this using adist()

Compute the approximate string distance between character vectors. The distance is a generalized Levenshtein (edit) distance, giving the minimal possibly weighted number of insertions, deletions and substitutions needed to transform one string into another.

Using your example data:

d <- adist(df$Var1)
# add rownames (this will prove useful later on)
rownames(d) <- df$Var1

> d
#                  [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#125 Hollywood St.    0    1    1   16   15   16   15   15   15    15
#125 Hllywood St.     1    0    2   15   14   15   15   14   14    14
#125 Hollywood St     1    2    0   15   15   15   14   14   15    15
#Target Store        16   15   15    0    2    1    2   10   10     9
#Trget Stre          15   14   15    2    0    3    4    9   10     8
#Target. Store       16   15   15    1    3    0    3   11   11    10
#T argetStore        15   15   14    2    4    3    0   10   11     9
#Walmart             15   14   14   10    9   11   10    0    5     2
#Walmart Inc.        15   14   15   10   10   11   11    5    0     6
#Wal marte           15   14   15    9    8   10    9    2    6     0

For this small sample, you can see the 3 distinct groups (the clusters of low Levensthein Distance values) and could easily assign them manually, but for larger sets, you will likely need a clustering algorithm.

I already pointed you in the comments to one of my previous answer showing how to do this using hclust() and the Ward's minimum variance method but I think here you would be better off using other techniques (one of my favorite resource on the topic for a quick overview of some of the most widely used methods in R is this detailed answer)

Here's an example using affinity propagation clustering:

library(apcluster)
d_ap <- apcluster(negDistMat(r = 1), d)

You will find in the APResult object d_ap the elements associated with each clusters and the optimum number of clusters, in this case: 3.

> d_ap@clusters
#[[1]]
#125 Hollywood St.  125 Hllywood St.  125 Hollywood St 
#                1                 2                 3 
#
#[[2]]
# Target Store    Trget Stre Target. Store  T argetStore 
#            4             5             6             7 
#
#[[3]]
#     Walmart Walmart Inc.    Wal marte 
#           8            9           10 

You can also see a visual representation:

> heatmap(d_ap, margins = c(10, 10))

enter image description here

Then, you can perform further manipulations for each group. As an example, here I use hunspell to lookup each separate words from Var1 in a en_US dictionary for spelling mistakes and try to find, within each group, which id has no spelling mistakes (potential_id)

library(dplyr)
library(tidyr)
library(hunspell)

tibble(Var1 = sapply(d_ap@clusters, names)) %>%
  unnest(.id = "group") %>%
  group_by(group) %>%
  mutate(id = row_number()) %>%
  separate_rows(Var1) %>%
  mutate(check = hunspell_check(Var1)) %>%
  group_by(id, add = TRUE) %>%
  summarise(checked_vars = toString(Var1), 
            result_per_word = toString(check), 
            potential_id = all(check))

Which gives:

#Source: local data frame [10 x 5]
#Groups: group [?]
#
#   group    id        checked_vars   result_per_word potential_id
#   <int> <int>               <chr>             <chr>        <lgl>
#1      1     1 125, Hollywood, St.  TRUE, TRUE, TRUE         TRUE
#2      1     2  125, Hllywood, St. TRUE, FALSE, TRUE        FALSE
#3      1     3  125, Hollywood, St  TRUE, TRUE, TRUE         TRUE
#4      2     1       Target, Store        TRUE, TRUE         TRUE
#5      2     2         Trget, Stre      FALSE, FALSE        FALSE
#6      2     3      Target., Store        TRUE, TRUE         TRUE
#7      2     4       T, argetStore       TRUE, FALSE        FALSE
#8      3     1             Walmart             FALSE        FALSE
#9      3     2       Walmart, Inc.       FALSE, TRUE        FALSE
#10     3     3          Wal, marte      FALSE, FALSE        FALSE

Note: Here since we haven't performed any text processing, the results are not very conclusive, but you get the idea.


Data

df <- tibble::tribble(
  ~Var1,                   
  "125 Hollywood St.",      
  "125 Hllywood St.",       
  "125 Hollywood St",       
  "Target Store",           
  "Trget Stre",             
  "Target. Store",          
  "T argetStore",           
  "Walmart",                
  "Walmart Inc.",           
  "Wal marte" 
)
Community
  • 1
  • 1
Steven Beaupré
  • 21,343
  • 7
  • 57
  • 77
0

Having 50K values does not mean you have to manually review 50K values. Start with an aggregation on the values which most likely reduce the scale of the problem. As a second step, normalize the values . You can remove signs e.g. spaces and periods, common words e.g. st. and Inc. etc. Normalize the addresses without the numbers.

David דודו Markovitz
  • 42,900
  • 6
  • 64
  • 88