0

I'm trying to find the strings in two list that almost match. Suppose there are two list as below

string_list_1 = ['apple_from_2018','samsung_from_2017','htc_from_2015','nokia_from_2010','moto_from_2019','lenovo_decommision_2017']

string_list_2 =
['apple_from_2020','samsung_from_2021','htc_from_2015','lenovo_decommision_2017']

Output
Similar = ['apple_from_2018','samsung_from_2017','htc_from_2015','lenovo_decommision_2017']
Not Similar =['nokia_from_2010','moto_from_2019']

I tried above one using below implementation but it is not giving proper result

similar = []
not_similar = []
for item1 in string_list_1:
   for item2 in string_list_2:
      if SequenceMatcher(a=item1,b=item2).ratio() > 0.90:
         similar.append(item1)
      else:
          not_similar.append(item1)
  

When I tried above implementation it is not as expected. It would be appreciated if someone could identify the missing part and to get required result

Aaditya R Krishnan
  • 495
  • 1
  • 10
  • 31

1 Answers1

2

You may make use of the following function in order to find similarity between two given strings

from difflib import SequenceMatcher

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()


print(similar("apple_from_2018", "apple_from_2020"))

Output :

0.8666666666666667

Thus using this function you may select the strings which cross the threshold value of percentage similarity. Although you may need to reduce your threshold from 90 to maybe 85 in order to get the expected output.

Thus the following code should work fine for you

string_list_1 = ['apple_from_2018','samsung_from_2017','htc_from_2015','nokia_from_2010','moto_from_2019','lenovo_decommision_2017']

string_list_2 = ['apple_from_2020','samsung_from_2021','htc_from_2015','lenovo_decommision_2017']



from difflib import SequenceMatcher


similar = []
not_similar = []
for item1 in string_list_1:

    # Set the state as false
    found = False
    for item2 in string_list_2:
        if SequenceMatcher(None, a=item1,b=item2).ratio() > 0.80:
            similar.append(item1)
            found = True
            break
    
    if not found:
        not_similar.append(item1)

print("Similar : ", similar)
print("Not Similar : ", not_similar)

Output :

Similar :  ['apple_from_2018', 'samsung_from_2017', 'htc_from_2015', 'lenovo_decommision_2017']
Not Similar :  ['nokia_from_2010', 'moto_from_2019']

This does cut down on the amount of time and redundant appends. Also I have reduced the similarity measure to 80 since 90 was too high. But feel free to tweak the values.

Tanishq Vyas
  • 1,422
  • 1
  • 12
  • 25
  • May I know is it possible avoid nested loop with better coding format – Aaditya R Krishnan Dec 21 '20 at 06:54
  • Just a clarification, You wish to select all the strings from string list1 such that they match 90% or more with any one of the strings in list 2, is that interpretation correct ? – Tanishq Vyas Dec 21 '20 at 06:57
  • Yes Tanisha. But is it possible to improve coding format – Aaditya R Krishnan Dec 21 '20 at 06:59
  • 1
    You must make use of nested loops since you must check through all possible pairs which may satisfy your condition. Thus you must iterate through all of them. However you may make use of continue keyword to switch over to next loop once an elemnt of match has been found. Also the code that you have listed above appends the mismatched word iteratively multiple times in case where similarity is lesser than 0.9. Thus you must ensure to break loops appropriately to reduce time taken and improve the solution. But the nesting is mandatory. It's Tanishq* : ) – Tanishq Vyas Dec 21 '20 at 07:02