0

I am trying to perform a fuzzywuzzy command comparing two columns in a dataframe. I want to know if a character string from one column ('Relationship') exists in another ('CUST_NAME'), even partially. Then repeat the process for a second column ('Dealer_Name'), on the same column as prior ('CUST_NAME'). I am currently trying to run the following code:

Here is my dataframe:

RapDF1 = RapDF[['APP_KEY','Relationship','Dealer_Name','CUST_NAME']]

Here is the fuzzy matching:

from fuzzywuzzy import process, fuzz

RapDF1.assign(dealer_compare=[process.extract(i, RapDF1['Dealer_Name'], limit=3) for i in RapDF1['CUST_NAME']])
RapDF1.assign(broker_compare=[process.extract(i, RapDF1['Relationship'], limit=3) for i in RapDF1['CUST_NAME']])

However, I receive the following python error:

TypeError                                 Traceback (most recent call last)
<ipython-input-76-2faf28514c26> in <module>()
     52 # Attempt 7
     53 
---> 54 RapDF1.assign(dealer_compare=[process.extract(i, RapDF1['Dealer_Name'], limit=3) for i in RapDF1['CUST_NAME']])
     55 RapDF1.assign(broker_compare=[process.extract(i, RapDF1['Relationship'], limit=3) for i in RapDF1['CUST_NAME']])
     56 

<ipython-input-76-2faf28514c26> in <listcomp>(.0)
     52 # Attempt 7
     53 
---> 54 RapDF1.assign(dealer_compare=[process.extract(i, RapDF1['Dealer_Name'], limit=3) for i in RapDF1['CUST_NAME']])
     55 RapDF1.assign(broker_compare=[process.extract(i, RapDF1['Relationship'], limit=3) for i in RapDF1['CUST_NAME']])
     56 

C:\ProgramData\Anaconda3\lib\site-packages\fuzzywuzzy\process.py in extract(query, choices, processor, scorer, limit)
    166     """
    167     sl = extractWithoutOrder(query, choices, processor, scorer)
--> 168     return heapq.nlargest(limit, sl, key=lambda i: i[1]) if limit is not None else \
    169         sorted(sl, key=lambda i: i[1], reverse=True)
    170 

C:\ProgramData\Anaconda3\lib\heapq.py in nlargest(n, iterable, key)
    567     # General case, slowest method
    568     it = iter(iterable)
--> 569     result = [(key(elem), i, elem) for i, elem in zip(range(0, -n, -1), it)]
    570     if not result:
    571         return result

C:\ProgramData\Anaconda3\lib\heapq.py in <listcomp>(.0)
    567     # General case, slowest method
    568     it = iter(iterable)
--> 569     result = [(key(elem), i, elem) for i, elem in zip(range(0, -n, -1), it)]
    570     if not result:
    571         return result

C:\ProgramData\Anaconda3\lib\site-packages\fuzzywuzzy\process.py in extractWithoutOrder(query, choices, processor, scorer, score_cutoff)
     76 
     77     # Run the processor on the input query.
---> 78     processed_query = processor(query)
     79 
     80     if len(processed_query) == 0:

C:\ProgramData\Anaconda3\lib\site-packages\fuzzywuzzy\utils.py in full_process(s, force_ascii)
     93         s = asciidammit(s)
     94     # Keep only Letters and Numbers (see Unicode docs).
---> 95     string_out = StringProcessor.replace_non_letters_non_numbers_with_whitespace(s)
     96     # Force into lowercase.
     97     string_out = StringProcessor.to_lower_case(string_out)

C:\ProgramData\Anaconda3\lib\site-packages\fuzzywuzzy\string_processing.py in replace_non_letters_non_numbers_with_whitespace(cls, a_string)
     24         numbers with a single white space.
     25         """
---> 26         return cls.regex.sub(" ", a_string)
     27 
     28     strip = staticmethod(string.strip)

TypeError: expected string or bytes-like object
Sashi
  • 2,659
  • 5
  • 26
  • 38
NateO
  • 3
  • 3
  • Hi, @NateO! Please, provide an example of data. Guess there are non-string entries in the dataset, you should check it. What version of python are you using, 2.x or 3.x? – Mikhail Stepanov Jan 17 '19 at 06:47
  • Hello! I am using Python 3.6. There are only string values in all the fields, but NaN values do exist when an entry was not provided. – NateO Jan 17 '19 at 13:21
  • What is the best way to provide data to help people out? – NateO Jan 17 '19 at 13:24
  • Print dataframe into terminal (just print, not jupyter notebook's html view) and post it as a code block. `NaN` has a `float` type and may cause and error while searching with fuzzywuzzy, try to drop it/replace with empty strings or so. I post it like an answer with simplified example. – Mikhail Stepanov Jan 18 '19 at 07:47

1 Answers1

0

Propably there are nan values in the dataframe, nan has a type float and causes an error:

from fuzzywuzzy import process, fuzz
import pandas as pd
import numpy as np

df_nan = pd.DataFrame({'text1': ["quick", "brown", "fox"], "text2": ["hello", np.NaN, "world"]})
df_nan
Out:
   text1  text2
0  quick  hello
1  brown    NaN
2    fox  world

Just an example of code which causes the same error:

[process.extract(i, df_nan['text1'], limit=3) for i in df_nan['text2']]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
...
/usr/local/lib/python3.6/dist-packages/fuzzywuzzy/string_processing.py in replace_non_letters_non_numbers_with_whitespace(cls, a_string)
     24         numbers with a single white space.
     25         """
---> 26         return cls.regex.sub(" ", a_string)
     27 
     28     strip = staticmethod(string.strip)

TypeError: expected string or bytes-like object

Replcace nan's with some token (choose correct token will be hard and data-dependent task, probably empty string is a bad choice):

df = df_nan.fillna('##SOME_TOKEN##') 
[process.extract(i, df['text1'], limit=3) for i in df['text2']]
Out:
[[('fox', 36, 2), ('brown', 20, 1), ('quick', 0, 0)],
 [('brown', 36, 1), ('fox', 30, 2), ('quick', 18, 0)],
 [('fox', 30, 2), ('brown', 20, 1), ('quick', 0, 0)]]

I guess replace or drop all non-string values will help.

Mikhail Stepanov
  • 3,680
  • 3
  • 23
  • 24
  • Thank Mikhail, I am trying that code out now! The reason for the error makes sense. It is running now, but it seems to be taking quite a bit. I have about 90,000 records I am comparing. Is fuzzywuzzy resource intensive or time consuming depending on the records? My process could be stalling. – NateO Jan 18 '19 at 18:34
  • It has `O(n squared)` complexity, where `n` is a number of rows, and each string compared - idk actually which algorithm is used - but I guess it `O(k squared)` where `k` is a length of string, but it's `_al least_` `O(k)`. Also, python isn't fast language, so it takes a time. Try to run this code on a small fragment, i.e. df ±2000 entries, measure the time of execution, and multiply it by `(90000 / 2000)^2`, so you'll get an approximate ETA. – Mikhail Stepanov Jan 18 '19 at 18:48