6

I am using pandas dataframes and I have data where I have customers per company. However, the company titles vary slightly but ultimately affect the data. Example:

Company    Customers
AAAB       1,000
AAAB Inc.  900
The AAAB Inc.  20
AAAB the INC   10

I want to get the total customers out of a data base of several different companies with the companies having non-standard names. Any idea where I should start?

jww
  • 97,681
  • 90
  • 411
  • 885
Alexis_Kiwis
  • 101
  • 2
  • 6
  • 1
    related: http://stackoverflow.com/q/13636848/1240268, I think best place to start is have the actual list of companies and try and work out a good metric/function for similar names. – Andy Hayden Oct 17 '13 at 21:48
  • I checked out that link and saw a great use to the get_close_matches. However, I had some trouble finding a proper execution. I used #Spend['Manufacturer'] = Spend['Manufacturer'].map(lambda x: difflib.get_close_matches(x,Spend['SKU/Catalog #'],5)) I also tried the below def spendd(n): manuf = [] for i in n: Spend['Manufacturer'].map(lambda i: difflib.get_close_matches(i,Spend['Manufacturer'],5)) return manuf spendd('Manufacturer') TypeError: object of type 'float' has no len() I am trying to fix this error but am unsure how. – Alexis_Kiwis Oct 17 '13 at 22:19
  • trigram similarity works wonders. I usually use a tool like cleanco and then use trigram similarity in PostgreSQL or a similarity match in Snowflake [both will work potentially faster than Python in many cases due to the cost of retrieving and storing the data, especially Snowflake] – Andrew Scott Evans Jul 27 '21 at 04:08

3 Answers3

10

I remember reading this blog about the fuzzywuzzy library (looking into another question), which can do this:

pip install fuzzywuzzy

You can use its partial_ratio function to "fuzzy match" the strings:

In [11]: from fuzzywuzzy.fuzz import partial_ratio

In [12]: partial_ratio('AAAB', 'the AAAB inc.')
Out[12]: 100

Which seems confident about it being a good match!

In [13]: partial_ratio('AAAB', 'AAPL')
Out[13]: 50

In [14]: partial_ratio('AAAB', 'Google')
Out[14]: 0

We can take the best match in the actual company list (assuming you have it):

In [15]: co_list = ['AAAB', 'AAPL', 'GOOG']

In [16]: df.Company.apply(lambda mistyped_co: max(co_list, 
                                                  key=lambda co: partial_ratio(mistyped_co, co)))
Out[16]: 
0    AAAB
1    AAAB
2    AAAB
3    AAAB
Name: Company, dtype: object

I strongly suspect there is something in scikit learn or a numpy library to do this more efficiently on large datasets... but this should get the job done.

If you don't have the company list you'll probably have to do something more clevererer...

Community
  • 1
  • 1
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • Hey this is awesome. Thanks for exposing me to the fuzzywuzzy article. I will take some time to check it out and see how I can apply it. – Alexis_Kiwis Oct 17 '13 at 22:36
1
splitCompaniesSet = map( lambda cmpnyName : 
    set( map( lambda name : name.split(" "), cmpnyName ) ), dataFrame['Company'] )

I think that's right.

Basically create a list of sets, each set has the company name split. Then, starting with the first element, find the set intersection of every other element with that one. For every non-empty intersection, change the name to whatever the simplest match was among all the non-empty resulting sets, i.e. take one more set intersection with all the nonempty sets and set the result to be the company name for all those non-empty matches.

Then go on to the next Company that resulted in an empty set when intersected with the first company name. Then do this for the next Company that was empty for the first two you tried, and so on.

There's probably a more efficient way to do it, though.

Dharman
  • 30,962
  • 25
  • 85
  • 135
Matthew Turner
  • 3,564
  • 2
  • 20
  • 21
  • Interesting, I will give it a go. I have been looking up normalization on the unicodedata module. I was not sure if that would apply to this situation however. I also discovered from zope.component import getUtility from plone.i18n.normalizer.interfaces import IIDNormalizer from http://developer.plone.org/misc/normalizing_ids.html and am not sure how I could import those modules as well. It seemed I needed to download something else. – Alexis_Kiwis Oct 17 '13 at 21:52
  • Ok, I see. Instead of messing with sets, it looks like you could use `fuzzywuzzy.fuzz.partial_ratio` and set all the ones where `partial_ratio`=100 to be the same name. This would remove the need to have the company list as Andy suggested. – Matthew Turner Oct 17 '13 at 22:42
0

Use pip install cleanco

Try following code:-

from cleanco import prepare_terms, basename
business_name = "The AAAB Inc."
terms = prepare_terms()
basename(business_name, terms, prefix=False, middle=False, suffix=True)

Expected output:-

The AAAB
ChrisF
  • 134,786
  • 31
  • 255
  • 325
tuhinsharma121
  • 186
  • 2
  • 9