0

Context of Problem

I am working on a project where I would like to compare two columns from a dataframe to determine what percent of the strings are similar to each other. Specifically, I'm comparing whether bullets scraped from retailer websites match the bullets that I expect to see on those sites for a given product.

I know that I can simply use boolean logic to determine if the value from column ['X'] == column ['Y']. But I'd like to take it to another level and determine what percentage of X matches Y. I did some research and found that difflib.ratio() can accomplish what I want.

Example of difflib.ratio()

a = 'preview'
b = 'previeu'

SequenceMatcher(a=a, b=b).ratio()

My Use Case

Where I'm having trouble is applying this logic to iterate through a DataFrame. This is what my DataFrame looks like.

DataFrame

The DataFrame has 5 "Bullets" and 5 "SEO Bullets". So I tried using a for loop to apply a lambda function to my DataFrame called test.

for x in range(1,6):
    test[f'Bullet {x} Ratio'] = test.apply(lambda row: SequenceMatcher(a=row[f'SeoBullet_{x}'], b=row[f'Bullet {x}']).ratio())

But I received the following error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-409-39a6ba3c8879> in <module>
      1 for x in range(1,6):
----> 2     test[f'Bullet {x} Ratio'] = test.apply(lambda row: SequenceMatcher(a=row[f'SeoBullet_{x}'], b=row[f'Bullet {x}']).ratio())

~\AppData\Local\Programs\PythonCodingPack\lib\site-packages\pandas\core\frame.py in apply(self, func, axis, raw, result_type, args, **kwds)
   7539             kwds=kwds,
   7540         )
-> 7541         return op.get_result()
   7542 
   7543     def applymap(self, func) -> "DataFrame":

~\AppData\Local\Programs\PythonCodingPack\lib\site-packages\pandas\core\apply.py in get_result(self)
    178             return self.apply_raw()
    179 
--> 180         return self.apply_standard()
    181 
    182     def apply_empty_result(self):

~\AppData\Local\Programs\PythonCodingPack\lib\site-packages\pandas\core\apply.py in apply_standard(self)
    253 
    254     def apply_standard(self):
--> 255         results, res_index = self.apply_series_generator()
    256 
    257         # wrap results

~\AppData\Local\Programs\PythonCodingPack\lib\site-packages\pandas\core\apply.py in apply_series_generator(self)
    282                 for i, v in enumerate(series_gen):
    283                     # ignore SettingWithCopy here in case the user mutates
--> 284                     results[i] = self.f(v)
    285                     if isinstance(results[i], ABCSeries):
    286                         # If we have a view on v, we need to make a copy because

<ipython-input-409-39a6ba3c8879> in <lambda>(row)
      1 for x in range(1,6):
----> 2     test[f'Bullet {x} Ratio'] = test.apply(lambda row: SequenceMatcher(a=row[f'SeoBullet_{x}'], b=row[f'Bullet {x}']).ratio())

~\AppData\Local\Programs\PythonCodingPack\lib\site-packages\pandas\core\series.py in __getitem__(self, key)
    880 
    881         elif key_is_scalar:
--> 882             return self._get_value(key)
    883 
    884         if (

~\AppData\Local\Programs\PythonCodingPack\lib\site-packages\pandas\core\series.py in _get_value(self, label, takeable)
    989 
    990         # Similar to Index.get_value, but we do not fall back to positional
--> 991         loc = self.index.get_loc(label)
    992         return self.index._get_values_for_loc(self, loc, label)
    993 

~\AppData\Local\Programs\PythonCodingPack\lib\site-packages\pandas\core\indexes\range.py in get_loc(self, key, method, tolerance)
    352                 except ValueError as err:
    353                     raise KeyError(key) from err
--> 354             raise KeyError(key)
    355         return super().get_loc(key, method=method, tolerance=tolerance)
    356 

KeyError: 'SeoBullet_1'

Desired Output

Ideally, the final output would be a dataframe that has 5 additional columns with the ratios for each Bullet comparison.

I'm still new-ish to Python, so I could just naïve and missing something very obvious. I say this also to say that if there is another route I could go to accomplish the same thing (or something very similar) I am open to those suggestions.

Ismaili Mohamedi
  • 906
  • 7
  • 15
  • Refrain from showing your dataframe as an image. Your question needs a minimal reproducible example consisting of sample input, expected output, actual output, and only the relevant code necessary to reproduce the problem. See [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) for best practices related to Pandas questions. – itprorh66 Oct 17 '22 at 14:20
  • Please clarify what you mean by the statement "Ideally, the final output would be a dataframe that has 5 additional columns with the ratios for each Bullet comparison." What ratio are you expecting to show for each row? Is the ratio for each row the same, if so why duplicate it in a column? If the ratio is expected to be different for each row, how is it computed? – itprorh66 Oct 17 '22 at 14:45

0 Answers0