Add category column to a data set

Question

I've a data table like this

+------------+-------+
|  Model     | Price | 
+------------+-------+
|  Apple-1   |   10  |
+------------+-------+
|  New Apple |   11  |
+------------+-------+
|  Orange    |   13  |
+------------+-------+
|  Orange2019|   15  |
+------------+-------+
|  Cat       |   19  |
+------------+-------+

I'want to define a list of base model tags that I want to add to any single row that matches certain condition/value. So for example defined a data frame for tagging like this

+------------+--------+
|  Model     |   Tag  | 
+------------+------ -+
|  Apple-1   |   A    |
+------------+------ -+
|  New Apple |   A    |
+------------+------ -+
|  Orange    |   B    |
+------------+------ -+
|  Cat       |   B    |
+------------+--------+

I would like to find some way to get this results:

+------------+-------+--------+
|  Model     | Price |  Tag   |
+------------+-------+--------+
|  Apple-1   |   10  |   A    |
+------------+-------+--------|
|  New Apple |   11  |   A    |
+------------+-------+--------|
|  Orange    |   13  |   B    |
+------------+-------+--------|
|  Orange2019|   15  |   B    |
+------------+-------+--------|
|  Cat       |   19  |   B    |
+------------+-------+--------|

I'm don't mind to use a table to managed the tagging data, and I know that I could write very "ad-hoc" mutate statement to achieve the results I want, just wondering if there is more elegant way to tagging a string based on a pattern match.

boski · Accepted Answer · 2019-02-20T16:40:35.023

One idea is to use the Levenshtein distances to cluster the words you have. You would need to provide with a number of clusters. Once you have this clusters, just add the number of each one as a category tag to your table. Check out this answer which goes into detail of Levenshtein distance clustering. Text clustering with Levenshtein distances

edit

I think I totally misunderstood your question... try this

df=data.frame("Model"=c("Apple-1","New Apple","Organe","Orange2019","Cat"),
              "Price"=c(10,11,13,15,19),stringsAsFactors = FALSE)
tags=data.frame("Model"=c("Apple-1","New Apple","Orange","Cat"),
                "Tag"=c("A","A","B","B"),stringsAsFactors = FALSE)


df%>%rowwise()%>%mutate(Tag=if_else(!is.na(tags$Tag[which(!is.na(str_extract(Model,tags$Model)))[1]]),
                                    tags$Tag[which(!is.na(str_extract(Model,tags$Model)))[1]],false="None"))

  Model      Price Tag  
  <chr>      <dbl> <chr>
1 Apple-1       10 A    
2 New Apple     11 A    
3 Organe        13 None 
4 Orange2019    15 B    
5 Cat           19 B

I actually changed Orange for Organe so that you see what happens if there is not match ( none is returned)

Really thanks works really well, even if I don't fully understand what's the magic inside :-). But I've figure out what's the basic idea behind, It's just a very slow process as, in my data-set, the total computation time it's over 5 minutes :-) — Ilproff_77, Feb 21 '19 at 21:15

Add category column to a data set

1 Answers1