Retain only the first word if the string contains duplicate words

Question

customer_name                                               ANDY
number_of_product_variants                                      2
number_of_channels                                              1
number_of_discount_codes                                        1
order_count                                                     1
order_name                                            #1100,#1100
discount_code                        Christmas2020, Christmas2020
channel                                      Instagram, Instagram
product_variant                    Avengers Set A, Avengers Set B

I would like to remove the duplicate word only if the string contains duplicates.

Expected output:

customer_name                                                ANDY
number_of_product_variants                                      2
number_of_channels                                              1
number_of_discount_codes                                        1
order_count                                                     1
order_name                                                  #1100
discount_code                                       Christmas2020
channel                                                 Instagram
product_variant                    Avengers Set A, Avengers Set B

The code I tried:

def unique_string(l):
    ulist = []
    [ulist.append(x) for x in l if x not in ulist]
    return ulist

customer_df['channel_2']=customer_df['channel']
customer_df['channel_2'].apply(unique_string)

Using the code below for only the channel column returns:

0                                   [S, e, a, r, c, h, ,]
1                    [P, a, i, d,  , A, s, :, S, o, c, l]
2                 [P, a, i, d,  , A, s, :, S, o, c, l, ,]
3                                      [U, n, k, o, w, ,]
```

jezrael · Accepted Answer · 2020-11-24T10:24:14.780

You can use set with splitted values by , if order is not important if multiple values.

If order is important use dict with .keys():

customer_df = pd.DataFrame({"channel_2":['Instagram, Instagram',
                                         'Instagram, Instagram1, Instagram, Instagram2']})
    
f1 = lambda x: ', '.join(set(y for y in x.split(', ')))
f2 = lambda x: ', '.join(dict.fromkeys(y for y in x.split(', ')).keys())

customer_df['channel_2_1'] = customer_df['channel_2'].apply(f1)
customer_df['channel_2_2'] = customer_df['channel_2'].apply(f2)
print (customer_df)
                                      channel_2  \
0                          Instagram, Instagram   
1  Instagram, Instagram1, Instagram, Instagram2   

                         channel_2_1                        channel_2_2  
0                          Instagram                          Instagram  
1  Instagram2, Instagram1, Instagram  Instagram, Instagram1, Instagram2

@Luc - Depends of data, if use `Instagram, Instagram1, Instagram, Instagram2` output should be different (only ordering) — jezrael, Nov 24 '20 at 10:14

score 1 · Answer 2 · answered Nov 24 '20 at 10:25

It seems like your dataframe contains strings representing lists and not lists.

Example:

'[ "Instagram", "Instagram" ]' and not ["Instagram", "Instagram"]

Note the outside single quotes.

You can see that because the for comprehension seems to iterate over the characters of the string and not over the elements of the list.

To convert a string representation of a list into a string you should first use:

import ast
customer_df["channel"] = customer_df["channel"].apply(ast.literal_eval)

If you want more information on ast.literal_eval, please refer to this question.

Then you can apply your function unique_string.

Can you explain why this is required in the problem given above? I am unable to understand why do we need literal_eval. — Mahendra Singh, Nov 24 '20 at 10:33

Retain only the first word if the string contains duplicate words

2 Answers2