-1

I have a dataframe with a bunch of names that are similar, but not identical. These names belong to the same band/location, and I need a way to renaming all the similar ones to a single name, so I can group my values.

Ex:

Names:   

HOT WATER BAND  
Hot Water Band  
HOT WATER BAND FROM SEATTLE   
Ho t Water Band FROM seATTLE  
HOT WATER  
ho t wat er BAND from  
BAND WATER HOT SEATTLE

Is there an easy way to clean and take all of these and rename them to a new variables for example "Hot Water Band From Seattle"? so that way it's easy to group all of my data afterwards

Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
Dasax121
  • 23
  • 8
  • What is the issue, exactly? Have you tried anything, done any research? Stack Overflow is not a free code writing service. See: [tour], [ask], [help/on-topic], https://meta.stackoverflow.com/questions/261592/how-much-research-effort-is-expected-of-stack-overflow-users. – AMC Feb 19 '20 at 02:57

1 Answers1

2

You could use the Levenshtein distance between the two strings, and if below a certain threshold convert the value to the target.

Example:

import pandas as pd, numpy as np
import Levenshtein

df['names'] = df['Names'].str.lower().str.strip() # normalize names
df['distance'] = df.apply(lambda x: Levenshtein.distance(x['Names'],
                                               df.groupby(['names'])['names'].\
                                                               value_counts().\
                                                               idxmax()[0]),axis=1)
df['converted'] = df.apply(lambda x: x['Names'].strip() 
                                     if x['distance'] <= 5 
                                     else np.nan,axis=1)  

Result which you could then group by:

In [165]: df['converted']
Out[165]: 
0               NaN
1    Hot Water Band
2               NaN
3               NaN
4               NaN
5               NaN
6               NaN
Name: converted, dtype: object
mechanical_meat
  • 163,903
  • 24
  • 228
  • 223
  • 1
    I think to have a fully systemic solution here - it would be probably better to not define the common key beforehand, but deduce it by grouping sentences accordingly to Levenshtein distances (or any other measure of similarity) between them. Which is not necessarily what OP wanted, yet that would generalize a bit this problem. – Grzegorz Skibinski Feb 18 '20 at 19:27
  • what would be the easiest way to use this if i have 100+ bands that all have similar names but not the exact names, like the example above? with my end goal being able to group the data by those names – Dasax121 Feb 18 '20 at 19:31
  • @Alex121: I generalized the solution a bit. Let me know if you find it useful. – mechanical_meat Feb 19 '20 at 01:56
  • 1
    @ VorsprungdurchTechnik thank you it did. I was thinking of a less pythonic version of doing this and I came up with the conclusion of writing a function that loops through the 'Names' columns and then using str.contains() adds the new name to a new empty column if it contains keywords like in this example 'water band'. Would you recommend this solution as well or stick with Levenshtein and experiment with fuzzywuzzy? – Dasax121 Feb 19 '20 at 15:25
  • Hey, if it works for you go with it. I personally would try out fuzzywuzzy, too. Best of luck. – mechanical_meat Feb 19 '20 at 15:34