Context of Problem
I am working on a project where I would like to compare two columns from a dataframe to determine what percent of the strings are similar to each other. Specifically, I'm comparing whether bullets scraped from retailer websites match the bullets that I expect to see on those sites for a given product.
I know that I can simply use boolean logic to determine if the value from column ['X'] == column ['Y'
]. But I'd like to take it to another level and determine what percentage of X matches Y. I did some research and found that difflib.ratio()
can accomplish what I want.
Example of difflib.ratio()
a = 'preview'
b = 'previeu'
SequenceMatcher(a=a, b=b).ratio()
My Use Case
Where I'm having trouble is applying this logic to iterate through a DataFrame. This is what my DataFrame looks like.
The DataFrame has 5 "Bullets" and 5 "SEO Bullets". So I tried using a for loop to apply a lambda function to my DataFrame called test.
for x in range(1,6):
test[f'Bullet {x} Ratio'] = test.apply(lambda row: SequenceMatcher(a=row[f'SeoBullet_{x}'], b=row[f'Bullet {x}']).ratio())
But I received the following error:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-409-39a6ba3c8879> in <module>
1 for x in range(1,6):
----> 2 test[f'Bullet {x} Ratio'] = test.apply(lambda row: SequenceMatcher(a=row[f'SeoBullet_{x}'], b=row[f'Bullet {x}']).ratio())
~\AppData\Local\Programs\PythonCodingPack\lib\site-packages\pandas\core\frame.py in apply(self, func, axis, raw, result_type, args, **kwds)
7539 kwds=kwds,
7540 )
-> 7541 return op.get_result()
7542
7543 def applymap(self, func) -> "DataFrame":
~\AppData\Local\Programs\PythonCodingPack\lib\site-packages\pandas\core\apply.py in get_result(self)
178 return self.apply_raw()
179
--> 180 return self.apply_standard()
181
182 def apply_empty_result(self):
~\AppData\Local\Programs\PythonCodingPack\lib\site-packages\pandas\core\apply.py in apply_standard(self)
253
254 def apply_standard(self):
--> 255 results, res_index = self.apply_series_generator()
256
257 # wrap results
~\AppData\Local\Programs\PythonCodingPack\lib\site-packages\pandas\core\apply.py in apply_series_generator(self)
282 for i, v in enumerate(series_gen):
283 # ignore SettingWithCopy here in case the user mutates
--> 284 results[i] = self.f(v)
285 if isinstance(results[i], ABCSeries):
286 # If we have a view on v, we need to make a copy because
<ipython-input-409-39a6ba3c8879> in <lambda>(row)
1 for x in range(1,6):
----> 2 test[f'Bullet {x} Ratio'] = test.apply(lambda row: SequenceMatcher(a=row[f'SeoBullet_{x}'], b=row[f'Bullet {x}']).ratio())
~\AppData\Local\Programs\PythonCodingPack\lib\site-packages\pandas\core\series.py in __getitem__(self, key)
880
881 elif key_is_scalar:
--> 882 return self._get_value(key)
883
884 if (
~\AppData\Local\Programs\PythonCodingPack\lib\site-packages\pandas\core\series.py in _get_value(self, label, takeable)
989
990 # Similar to Index.get_value, but we do not fall back to positional
--> 991 loc = self.index.get_loc(label)
992 return self.index._get_values_for_loc(self, loc, label)
993
~\AppData\Local\Programs\PythonCodingPack\lib\site-packages\pandas\core\indexes\range.py in get_loc(self, key, method, tolerance)
352 except ValueError as err:
353 raise KeyError(key) from err
--> 354 raise KeyError(key)
355 return super().get_loc(key, method=method, tolerance=tolerance)
356
KeyError: 'SeoBullet_1'
Desired Output
Ideally, the final output would be a dataframe that has 5 additional columns with the ratios for each Bullet comparison.
I'm still new-ish to Python, so I could just naïve and missing something very obvious. I say this also to say that if there is another route I could go to accomplish the same thing (or something very similar) I am open to those suggestions.