3

I have a data frame DF like this:

DF = pd.DataFrame({'Code':['abc', 'abc', 'abc', 'abc', 'def'],  
               'Description':['ABC String', 'ABC String', 'ABC String and sth', 'Only sth else', 'ABC String'],     
               'Value':[10, 20, 30, 40, 100]})  

enter image description here

I need to group it by Code and Description. Grouping by Code is simple:

GR = DF.groupby('Code')

enter image description here

Now I want to continue with grouping by Description, so all values that are equal or similar (have a common part) are grouped together. Can you help me with a formula to get something like this:

enter image description here

It might be there are two questions: 'equal values' and 'similar values'. If there is any hint on at least the 'equal values', that would be great.

Rafal Xxx
  • 65
  • 1
  • 7

4 Answers4

4

You could also use fuzzywuzzy to compute the Levensthein distance, even if there are more than two "similar" values

For example

import numpy as np
import pandas as pd
from fuzzywuzzy import fuzz

DF = pd.DataFrame({'Code':['abc', 'abc', 'abc', 'abc', 'def', 'def', 'def', 'abc'],  
               'Description':['ABC String', 'ABC String', 
                              'ABC String and sth', 'Only sth else', 
                              'ABC String', 'CDEFGH', 'CDEFGH and sth', 
                              'CDEFGH and sth',],
               'Value':[10, 20, 30, 40, 50, 60, 70, 80]}) 

# for each unique value in Description
for d in DF.Description.unique():
    # compute Levensthein distance
    # and set to True if >= a limit
    # (you may have to play around with it)
    DF[d] = DF['Description'].apply(
        lambda x : fuzz.ratio(x, d) >= 60
    )
    # set a name for the group
    # here, simply the shortest
    m = np.min(DF[DF[d]==True].Description)
    # assign the group
    DF.loc[DF.Description==d, 'group'] = m

print(DF)

  Code         Description  Value  ABC String          group  \
0  abc          ABC String     10        True     ABC String   
1  abc          ABC String     20        True     ABC String   
2  abc  ABC String and sth     30        True     ABC String   
3  abc       Only sth else     40       False  Only sth else   
4  def          ABC String     50        True     ABC String   
5  def              CDEFGH     60       False         CDEFGH   
6  def      CDEFGH and sth     70       False         CDEFGH   
7  abc      CDEFGH and sth     80       False         CDEFGH   

   ABC String and sth  Only sth else  CDEFGH  CDEFGH and sth  
0                True          False   False           False  
1                True          False   False           False  
2                True          False   False           False  
3               False           True   False           False  
4                True          False   False           False  
5               False          False    True            True  
6               False          False    True            True  
7               False          False    True            True  

And now you can groupby the created groups

DF.groupby('group').Value.mean()

group
ABC String       27.5
CDEFGH           70.0
Only sth else    40.0
Name: Value, dtype: float64
Max Pierini
  • 2,027
  • 11
  • 17
3

To check for similar strings you can use jellyfish.levenshtein_distance. Idea is to iterate over each group and grab the most frequent element from the group, then evaluate the levenshtein_distance relative to that most frequent element. If the distance is close to 0 that means the given string is similar and vice-versa.

# from difflib import SequenceMatcher
from statistics import mode
import jellyfish

import pandas as pd

df = pd.DataFrame({'Code': ['abc', 'abc', 'abc', 'abc', 'def'],
                   'Description': ['ABC String', 'abc string', 'ABC String and sth', 'Only sth else', 'ABC String'],
                   'Value': [10, 20, 30, 40, 100]})

df_list = []
for grp,df in df.groupby('Code'):
    df['distance'] = df['Description'].apply(lambda x : jellyfish.levenshtein_distance(x, mode(df['Description'])))
    df['Description'] =  mode(df['Description'])
    df_list.append(df[df['distance'] < 10])

df = pd.concat(df_list).drop('distance', axis=1)
print(df)
    

output -

  Code Description  Value
0  abc  ABC String     10
1  abc  ABC String     20
2  abc  ABC String     30
4  def  ABC String    100

For more better and accurate analysis - convert the string to lower case remove spaces and punctuation and then follow the algorithm.

Nk03
  • 14,699
  • 2
  • 8
  • 22
  • Thank you a lot @Nk03. This is interesting approach which I will study more to accommodate to my final solution. – Rafal Xxx Apr 24 '21 at 16:37
0

this really depends on how you define similar.... if you say first 10 characters you can use string slicing.

DF = pd.DataFrame({'Code':['abc', 'abc', 'abc', 'abc', 'def'],  
               'Description':['ABC String', 'ABC String', 'ABC String and sth', 'Only sth else', 'ABC String'],     
               'Value':[10, 20, 30, 40, 100]})  

DF.groupby(["Code", DF.Description.str[:10]])["Value"].sum()
Rob Raymond
  • 29,118
  • 3
  • 14
  • 30
0

Try to add boolean column with function and then apply:

val = 'ABC String'

df['boo'] = df['description'].apply(lambda x: 1 if x.find(val)>=0 else 0)
df
code description value boo
abc ABC String 10 1
abc ABC String 20 1
abc ABC String and sth 30 1
abc Only sth else 40 0
def ABC String 100 1

and then some wrangling

df = df[df.boo == 1]
df = df.iloc[:,:-1]
df
code description value
abc ABC String 10
abc ABC String 20
abc ABC String and sth 30
def ABC String 100
AdrianSK
  • 55
  • 5