How to group data frame with similar text in python

Question

I have a data frame DF like this:

DF = pd.DataFrame({'Code':['abc', 'abc', 'abc', 'abc', 'def'],  
               'Description':['ABC String', 'ABC String', 'ABC String and sth', 'Only sth else', 'ABC String'],     
               'Value':[10, 20, 30, 40, 100]})

I need to group it by Code and Description. Grouping by Code is simple:

GR = DF.groupby('Code')

Now I want to continue with grouping by Description, so all values that are equal or similar (have a common part) are grouped together. Can you help me with a formula to get something like this:

It might be there are two questions: 'equal values' and 'similar values'. If there is any hint on at least the 'equal values', that would be great.

for the 'equal values' have you tried pivot table like [this](https://stackoverflow.com/questions/35414625/pandas-how-to-run-a-pivot-with-a-multi-index)? — Amri Rasyidi, Apr 24 '21 at 08:59
Are there many other "equal" strings? Other than ```ABC String```? — Serge de Gosson de Varennes, Apr 24 '21 at 09:41
Thank you Amri, I will have a look. Yes Serge, there are other strings in the original sets. — Rafal Xxx, Apr 24 '21 at 15:47

score 4 · Answer 1 · answered Apr 24 '21 at 11:17

You could also use fuzzywuzzy to compute the Levensthein distance, even if there are more than two "similar" values

For example

import numpy as np
import pandas as pd
from fuzzywuzzy import fuzz

DF = pd.DataFrame({'Code':['abc', 'abc', 'abc', 'abc', 'def', 'def', 'def', 'abc'],  
               'Description':['ABC String', 'ABC String', 
                              'ABC String and sth', 'Only sth else', 
                              'ABC String', 'CDEFGH', 'CDEFGH and sth', 
                              'CDEFGH and sth',],
               'Value':[10, 20, 30, 40, 50, 60, 70, 80]}) 

# for each unique value in Description
for d in DF.Description.unique():
    # compute Levensthein distance
    # and set to True if >= a limit
    # (you may have to play around with it)
    DF[d] = DF['Description'].apply(
        lambda x : fuzz.ratio(x, d) >= 60
    )
    # set a name for the group
    # here, simply the shortest
    m = np.min(DF[DF[d]==True].Description)
    # assign the group
    DF.loc[DF.Description==d, 'group'] = m

print(DF)

  Code         Description  Value  ABC String          group  \
0  abc          ABC String     10        True     ABC String   
1  abc          ABC String     20        True     ABC String   
2  abc  ABC String and sth     30        True     ABC String   
3  abc       Only sth else     40       False  Only sth else   
4  def          ABC String     50        True     ABC String   
5  def              CDEFGH     60       False         CDEFGH   
6  def      CDEFGH and sth     70       False         CDEFGH   
7  abc      CDEFGH and sth     80       False         CDEFGH   

   ABC String and sth  Only sth else  CDEFGH  CDEFGH and sth  
0                True          False   False           False  
1                True          False   False           False  
2                True          False   False           False  
3               False           True   False           False  
4                True          False   False           False  
5               False          False    True            True  
6               False          False    True            True  
7               False          False    True            True

And now you can groupby the created groups

DF.groupby('group').Value.mean()

group
ABC String       27.5
CDEFGH           70.0
Only sth else    40.0
Name: Value, dtype: float64

score 3 · Accepted Answer · answered Apr 24 '21 at 10:45

To check for similar strings you can use jellyfish.levenshtein_distance. Idea is to iterate over each group and grab the most frequent element from the group, then evaluate the levenshtein_distance relative to that most frequent element. If the distance is close to 0 that means the given string is similar and vice-versa.

# from difflib import SequenceMatcher
from statistics import mode
import jellyfish

import pandas as pd

df = pd.DataFrame({'Code': ['abc', 'abc', 'abc', 'abc', 'def'],
                   'Description': ['ABC String', 'abc string', 'ABC String and sth', 'Only sth else', 'ABC String'],
                   'Value': [10, 20, 30, 40, 100]})

df_list = []
for grp,df in df.groupby('Code'):
    df['distance'] = df['Description'].apply(lambda x : jellyfish.levenshtein_distance(x, mode(df['Description'])))
    df['Description'] =  mode(df['Description'])
    df_list.append(df[df['distance'] < 10])

df = pd.concat(df_list).drop('distance', axis=1)
print(df)

output -

  Code Description  Value
0  abc  ABC String     10
1  abc  ABC String     20
2  abc  ABC String     30
4  def  ABC String    100

For more better and accurate analysis - convert the string to lower case remove spaces and punctuation and then follow the algorithm.

Thank you a lot @Nk03. This is interesting approach which I will study more to accommodate to my final solution. — Rafal Xxx, Apr 24 '21 at 16:37

score 0 · Answer 3 · answered Apr 24 '21 at 10:31

this really depends on how you define similar.... if you say first 10 characters you can use string slicing.

DF = pd.DataFrame({'Code':['abc', 'abc', 'abc', 'abc', 'def'],  
               'Description':['ABC String', 'ABC String', 'ABC String and sth', 'Only sth else', 'ABC String'],     
               'Value':[10, 20, 30, 40, 100]})  

DF.groupby(["Code", DF.Description.str[:10]])["Value"].sum()

score 0 · Answer 4 · answered Apr 24 '21 at 11:37

Try to add boolean column with function and then apply:

val = 'ABC String'

df['boo'] = df['description'].apply(lambda x: 1 if x.find(val)>=0 else 0)
df

code	description	value	boo
abc	ABC String	10	1
abc	ABC String	20	1
abc	ABC String and sth	30	1
abc	Only sth else	40	0
def	ABC String	100	1

and then some wrangling

df = df[df.boo == 1]
df = df.iloc[:,:-1]
df

code	description	value
abc	ABC String	10
abc	ABC String	20
abc	ABC String and sth	30
def	ABC String	100

How to group data frame with similar text in python

4 Answers4