2

I have a pandas dataframe called "df_combo" which contains columns "worker_id", "url_entrance", "company_name". I am trying to produce an output column that would tell me if the URLs in "url_entrance" column contains any word in "company_name" column. Even a close match like fuzzywuzzy would work.

For example, if the URL is "www.grandhotelseattle.com" and the "company_name" is "Hotel Prestige Seattle", then the fuzz ratio might be somewhere 70-80.

I have tried the following script: >>>fuzz.ratio(df_combo['url_entrance'],df_combo['company_name']) but it returns only 1 number which is the overall fuzz ratio for the whole column. I would like to have fuzz ratio for every row and store those ratios in a new column.

agg3l
  • 1,444
  • 15
  • 21
Stanleyrr
  • 858
  • 3
  • 12
  • 31
  • Here is possibly linked question on SO [create new column in dataframe using fuzzywuzzy](http://stackoverflow.com/questions/36138886/create-new-column-in-dataframe-using-fuzzywuzzy) – agg3l Oct 20 '16 at 00:45
  • Not exactly but related. Here the resulting table will have the length of the original table square. (That's a dimension more than a column...) – kpie Oct 20 '16 at 02:12
  • @agg3l, I checked that link. Ran the scripts, but got an error saying: "TypeError: ("object of type 'float' has no len()", u'occurred at index 3206')". – Stanleyrr Oct 20 '16 at 02:41
  • Share df_combo.head() so that we visualize better your df and your issue – Zeugma Oct 20 '16 at 04:39

1 Answers1

4

Thanks everyone for your inputs. I have solved my problem! The link that "agg3l" provided was helpful. The "TypeError" I saw was because either the "url_entrance" or "company_name" has some floating types in certain rows. I converted both columns to string using the following scripts, re-ran the fuzz.ratio script and got it to work!

df_combo['url_entrance']=df_combo['url_entrance'].astype(str) df_combo['company_name']=df_combo['company_name'].astype(str)

Stanleyrr
  • 858
  • 3
  • 12
  • 31
  • This was helpful. In my case, I had some NaN values, so a simple `df['a'].fillna(' ', inplace=True)` before running the fuzzy matched worked. – cyril Apr 26 '17 at 20:43