Find and impute the centroid of words (strings)

Question

Suppose I have the following data frame with car brands. How can I find the centroid of each brand (word) and impute that centroid to the most "similar" words? In order to get a second column, pal_ok with the normalized marks.

db <- data.frame(pal1 = c("fiat","fiat","fiat","fiat 1","fiatt","fait","fiaat","renault","renault","renault","renaultt","renault 3","renaultc","remault"))

        pal1
1       fiat
2       fiat
3       fiat
4     fiat 1
5      fiatt
6       fait
7      fiaat
8    renault
9    renault
10   renault
11  renaultt
12 renault 3
13  renaultc
14   remault

db <- data.frame(pal1 = c("fiat","fiat","fiat","fiat 1","fiatt","fait","fiaat","renault","renault","renault","renaultt","renault 3","renaultc","remault"),
               pal_ok  =c("fiat","fiat","fiat","fiat","fiat","fiat","fiat","renault","renault","renault","renault","renault","renault","renault"))

        pal1  pal_ok
1       fiat    fiat
2       fiat    fiat
3       fiat    fiat
4     fiat 1    fiat
5      fiatt    fiat
6       fait    fiat
7      fiaat    fiat
8    renault renault
9    renault renault
10   renault renault
11  renaultt renault
12 renault 3 renault
13  renaultc renault
14   remault renault

As the most frequent word (in this cases are fiat and renault). — lolo, Dec 20 '18 at 01:28
Maybe you need stemming of words. This could be helpful https://stackoverflow.com/questions/24443388/stemming-with-r-text-analysis — Ronak Shah, Dec 20 '18 at 01:41

s__ · Accepted Answer · 2018-12-20T12:40:13.840

You can try this with base function adist, and some dplyr chain:

# here you calculate your "centroids", i.e. the most common words if you mean that
pal <- as.data.frame.table(table(db$pal1)) %>%                    # table of freq
       arrange(Freq) %>%                                          # arrange it
       top_n(2)                                                   # take the top 2, consider your
                                                                  # data to choose the tops

 pal
     Var1 Freq
1    fiat    3
2 renault    3

Now we can calculate the distance between each "centroids" and the words:

# here the distance 
dist <- data.frame(adist(db$pal1,pal$Var1))

# rename the columns, in this case with only two brands
colnames(dist) <- c('fiat','renault')

 dist
   fiat renault
1     0       5
2     0       5
3     0       5
4     2       6
5     1       5
6     2       5
7     1       5
8     5       0
9     5       0
10    5       0
11    6       1
12    7       2
13    6       1
14    5       1

Now we can choose the one with the smallest distance:

cbind(db,dist) %>%                                               # bind data and freq
mutate(pal_calc = ifelse(fiat<renault,'fiat','renault')) %>%     # choose the bigger 
select(-c(fiat,renault))                                         # remove useless columns            

        pal1  pal_ok pal_calc
1       fiat    fiat     fiat
2       fiat    fiat     fiat
3       fiat    fiat     fiat
4     fiat 1    fiat     fiat
5      fiatt    fiat     fiat
6       fait    fiat     fiat
7      fiaat    fiat     fiat
8    renault renault  renault
9    renault renault  renault
10   renault renault  renault
11  renaultt renault  renault
12 renault 3 renault  renault
13  renaultc renault  renault
14   remault renault  renault

Find and impute the centroid of words (strings)

1 Answers1