2
customer_name                                               ANDY
number_of_product_variants                                      2
number_of_channels                                              1
number_of_discount_codes                                        1
order_count                                                     1
order_name                                            #1100,#1100
discount_code                        Christmas2020, Christmas2020
channel                                      Instagram, Instagram
product_variant                    Avengers Set A, Avengers Set B

I would like to remove the duplicate word only if the string contains duplicates.

Expected output:

customer_name                                                ANDY
number_of_product_variants                                      2
number_of_channels                                              1
number_of_discount_codes                                        1
order_count                                                     1
order_name                                                  #1100
discount_code                                       Christmas2020
channel                                                 Instagram
product_variant                    Avengers Set A, Avengers Set B

The code I tried:

def unique_string(l):
    ulist = []
    [ulist.append(x) for x in l if x not in ulist]
    return ulist

customer_df['channel_2']=customer_df['channel']
customer_df['channel_2'].apply(unique_string)

Using the code below for only the channel column returns:

0                                   [S, e, a, r, c, h, ,]
1                    [P, a, i, d,  , A, s, :, S, o, c, l]
2                 [P, a, i, d,  , A, s, :, S, o, c, l, ,]
3                                      [U, n, k, o, w, ,]
```
Luc
  • 737
  • 1
  • 9
  • 22

2 Answers2

1

You can use set with splitted values by , if order is not important if multiple values.

If order is important use dict with .keys():

customer_df = pd.DataFrame({"channel_2":['Instagram, Instagram',
                                         'Instagram, Instagram1, Instagram, Instagram2']})
    
f1 = lambda x: ', '.join(set(y for y in x.split(', ')))
f2 = lambda x: ', '.join(dict.fromkeys(y for y in x.split(', ')).keys())

customer_df['channel_2_1'] = customer_df['channel_2'].apply(f1)
customer_df['channel_2_2'] = customer_df['channel_2'].apply(f2)
print (customer_df)
                                      channel_2  \
0                          Instagram, Instagram   
1  Instagram, Instagram1, Instagram, Instagram2   

                         channel_2_1                        channel_2_2  
0                          Instagram                          Instagram  
1  Instagram2, Instagram1, Instagram  Instagram, Instagram1, Instagram2  
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • @Luc - Depends of data, if use `Instagram, Instagram1, Instagram, Instagram2` output should be different (only ordering) – jezrael Nov 24 '20 at 10:14
1

It seems like your dataframe contains strings representing lists and not lists.

Example:

'[ "Instagram", "Instagram" ]' and not ["Instagram", "Instagram"]

Note the outside single quotes.

You can see that because the for comprehension seems to iterate over the characters of the string and not over the elements of the list.

To convert a string representation of a list into a string you should first use:

import ast
customer_df["channel"] = customer_df["channel"].apply(ast.literal_eval) 

If you want more information on ast.literal_eval, please refer to this question.

Then you can apply your function unique_string.

robinood
  • 1,138
  • 8
  • 16