2

Is there any way we can use Pandas to calculate the string similarity with the previous rows in the column?

Row 1: Businesses Pte Ltd

Row 2: Business Pvt Ltd

Row 3: Global Pvt Ltd

It will compare the Row 1 and Row 2, come up with a percentage of similarity. If it is about 90%, replace Row 2 with Row 1 values and so on.

Result

Row 1: Businesses Pte Ltd

Row 2: Businesses Pte Ltd

Row 3: Global Pvt Ltd

newtoCS
  • 113
  • 9
  • Can you provide a definition for "percentage of similarity"? – jpp Mar 06 '18 at 09:22
  • Can be based on the number of chars, how many chars are different from the previous row.. – newtoCS Mar 06 '18 at 09:26
  • That's interesting. I'm afraid SO isn't the best place to design the *logic* for you (but see links in @Matthew's answer for ideas). You will find many people here who are willing to take your logic and transfer it to code in an efficient way. – jpp Mar 06 '18 at 09:40

1 Answers1

2

This is a surprisingly tricky problem. Presumably you sorted the rows alphabetically first - but what happens if the typo is in the 1st letter? "Businesses Pte Ltd" is a long way from "Vusinesses Pte Ltd".

Still - to solve your problem you want to combine these two solutions:

Find the similarity percent between two strings

Comparing previous row values in Pandas DataFrame

It should get you something workable.

Matthew
  • 10,361
  • 5
  • 42
  • 54