1

Hi and thanks in advance, new to python and pandas.

  1. I have a df column df['name'], this large data consists of product names all with different lengths,letters, numbers, punctuation and spacing. This makes each name a unique value this makes it hard to find variants of some of the products.

  2. I then split the column values by spacing.

    df['name'].str.split(" ",expand = True)

I Found some code in this Question but I dont know how to apply it to iterate and compare through a list as its using variables and 2 list and I just have one list. How can I compare two lists in python and return matches?

Not the most efficient one, but by far the most obvious way to do it is: a = [1, 2, 3, 4, 5] b = [9, 8, 7, 6, 5] set(a) & set(b) {5} if order is significant you can do it with list comprehensions like this: [i for i, j in zip(a, b) if i == j] [5]

  1. What im trying to achieve is:

data set

1.star t-shirt-large-red 2.star t-shirt-large-blue 3.star t-shirt-small-red 4.beautiful rainbow skirt small 5.long maxwell logan jeans- light blue -32L-28W 6.long maxwell logan jeans- Dark blue -32L-28W

-compare all items in the list against each other and find the longest string match. Example: products:1,2,3 have matching partial strings result COL1 COL2 COL3 COL4 1[star t-shirt] [large] [red] NONE 2[star t-shirt] [large] [blue] NONE 3[star t-shirt] [small] [red] NONE 4[beautiful rainbow skirt small] NONE NONE NONE 5[long maxwell logan jeans] [light blue] [32L] [28W] 6[long maxwell logan jeans] [Dark blue] [32L] [28W]

Can anyone point me in the right direction in how to achieve my end result. I researched about modules like fuzzywuzzy and diffilab but don't know how to apply it also regex but im not sure how I would achieve string matching in a list with so many different formats? Please when responding can you explain it step by step so I can understand what your doing and why. Just for learning purposes Thank you in advance again.

  • How do you decide whether or not to separate `[light blue]`? Although humans know this is color, but program does not know. They will be separated according to your space separator. – giser_yugang Sep 04 '18 at 07:41
  • I was think with the "-" maybe but im not sure if thats possible. Also if your comparing to the previous string would it output the matched longest string . – London Holder Sep 04 '18 at 08:31
  • Maybe split by the dashes first then do a match comparison, then do another split by white spaces and match comparsion – London Holder Sep 04 '18 at 08:54
  • @giser_yugang do you have any ideas, I think might have to do this sequentially but doing a number of splitting and sorting. – London Holder Sep 04 '18 at 23:16

1 Answers1

1

Well, your question is really big. I think you have to reconsider the purpose of doing so.

In the first step, each row corresponds to each other.

df['onkey'] = 1
df1 = pd.merge(df[['name','onkey']],df[['name','onkey']], on='onkey')
df1['list'] = df1.apply(lambda x:[x.name_x,x.name_y],axis=1)

The second step is to find the longest string match.

from os.path import commonprefix
df1['COL1'] = df1['list'].apply(lambda x:commonprefix(x))

Remove rows that not find string match.

df1['COL1_num'] = df1['COL1'].apply(lambda x:len(x))
df1 = df1[(df1['COL1_num']!=0)]

Find the shortest match.

df1 = df1.loc[df1.groupby('name_x')['COL1_num'].idxmin()]

Merge df and df1.

df = df.rename(columns ={'name':'name_x'})
df = pd.merge(df,df1[['name_x','COL1']],on='name_x',how ='left')

We can look data like this:

                                         name_x  onkey                           COL1
0                         star t-shirt-large-red      1                  star t-shirt-
1                        star t-shirt-large-blue      1                  star t-shirt-
2                         star t-shirt-small-red      1                  star t-shirt-
3                  beautiful rainbow skirt small      1  beautiful rainbow skirt small
4  long maxwell logan jeans- light blue -32L-28W      1     long maxwell logan jeans- 
5   long maxwell logan jeans- Dark blue -32L-28W      1     long maxwell logan jeans-

As you can see, we have found the longest string match.

Handle the common string, and we separate the remaining string.

df['len'] = df['COL1'].apply(lambda x: len(x))
df['other'] = df.apply(lambda x: x.name_x[x.len:],axis=1)
df['COL1'] = df['COL1'].apply(lambda x: x.strip())
df['COL1'] = df['COL1'].apply(lambda x: x[:-1] if x[-1]=='-' else x)
df['other'] = df['other'].apply(lambda x:x.split('-'))

Finally, we will concat them.

df = df[['COL1','other']]
df = pd.concat([df['COL1'],df['other'].apply(pd.Series)],axis=1)

The result:

                            COL1            0     1    2
0                   star t-shirt        large   red  NaN
1                   star t-shirt        large  blue  NaN
2                   star t-shirt        small   red  NaN
3  beautiful rainbow skirt small                NaN  NaN
4       long maxwell logan jeans  light blue    32L  28W
5       long maxwell logan jeans   Dark blue    32L  28W
giser_yugang
  • 6,058
  • 4
  • 21
  • 44
  • Thanks so much for responding and showing me step by step. But essentially this method would take a long time in regards to whole it would take to complete due it comparing every single row by one row and going down the column. – London Holder Sep 06 '18 at 03:07
  • According to your needs, I think you can only do so. – giser_yugang Sep 06 '18 at 03:44
  • ok make sense. I did try it out on my data set but seems to be pulling the first letter of the string. – London Holder Sep 06 '18 at 03:54