Hi and thanks in advance, new to python and pandas.
I have a df column
df['name']
, this large data consists of product names all with different lengths,letters, numbers, punctuation and spacing. This makes each name a unique value this makes it hard to find variants of some of the products.I then split the column values by spacing.
df['name'].str.split(" ",expand = True)
I Found some code in this Question but I dont know how to apply it to iterate and compare through a list as its using variables and 2 list and I just have one list. How can I compare two lists in python and return matches?
Not the most efficient one, but by far the most obvious way to do it is:
a = [1, 2, 3, 4, 5]
b = [9, 8, 7, 6, 5]
set(a) & set(b)
{5}
if order is significant you can do it with list comprehensions like this:
[i for i, j in zip(a, b) if i == j]
[5]
- What im trying to achieve is:
data set
1.star t-shirt-large-red
2.star t-shirt-large-blue
3.star t-shirt-small-red
4.beautiful rainbow skirt small
5.long maxwell logan jeans- light blue -32L-28W
6.long maxwell logan jeans- Dark blue -32L-28W
-compare all items in the list against each other and find the longest string match. Example: products:1,2,3 have matching partial strings
result
COL1 COL2 COL3 COL4
1[star t-shirt] [large] [red] NONE
2[star t-shirt] [large] [blue] NONE
3[star t-shirt] [small] [red] NONE
4[beautiful rainbow skirt small] NONE NONE NONE
5[long maxwell logan jeans] [light blue] [32L] [28W]
6[long maxwell logan jeans] [Dark blue] [32L] [28W]
Can anyone point me in the right direction in how to achieve my end result. I researched about modules like fuzzywuzzy and diffilab but don't know how to apply it also regex but im not sure how I would achieve string matching in a list with so many different formats? Please when responding can you explain it step by step so I can understand what your doing and why. Just for learning purposes Thank you in advance again.