Updating pandas dataframe with new column

Question

I want to create a new column with all the distinct values across the rows. Each value in a row is a string(not list).

This is how dataframe looks like:

+-----------------------------+-------------------------+---------------------------------------------+
|         first               |            second       |           third                             |  
+-----------------------------+-------------------------+---------------------------------------------+
|['able', 'shovel', 'door']   |['shovel raised']        |['shovel raised', 'raised', 'door', 'shovel']|
|['grade control']            |['grade']                |['grade']                                    |
|['light telling', 'love']    |['would love', 'closed'] |['closed', 'light']                          |
+-----------------------------+-------------------------+---------------------------------------------+

This is how the dataframe should look like after creating a new column with distinct values.

df = pd.DataFrame({'first': "['able', 'shovel', 'door']" , 'second': "['shovel raised']", 'third': "['shovel raised', 'raised', 'door', 'shovel']", "Distinct_set": "['able', 'shovel', 'door', 'shovel raised', 'raised']" }, index = [0])

How can I do it?

Please provide a reproducible code to produce the data so that people can directly copy and use. Also, can you please provide more details. — Akash Dubey, Feb 14 '20 at 06:51
Could you tell a bit more about what you are trying to achieve? and code you have tried would also aid that endeavor. — Back2Basics, Feb 14 '20 at 06:51

Mohit Sharma · Accepted Answer · 2020-02-14T07:32:49.407

1

try this:

df['new_col'] = df.apply(lambda x: list(set(x['first'] + x['second']+x['third'])), axis =1)

its creating set of single char as your data in cell is string.

"['able', 'shovel', 'door']"

to correct this use below:

df['new_col'] = df.apply(lambda x: list(set(eval(x['first']) + eval(x['second'])+eval(x['third']))), axis =1)

edited Feb 14 '20 at 07:32

answered Feb 14 '20 at 06:55

Mohit Sharma

590
3
10

It is creating a set of each alphabet and punctuations whereas I need set of words in an inverted comma. – dj2560 Feb 14 '20 at 07:10
@Mohit Sharma, avoid using eval, they have very destructive outputs. Check - https://stackoverflow.com/questions/1832940/why-is-using-eval-a-bad-practice – RockStar Feb 14 '20 at 08:24

score 1 · Answer 2 · answered Feb 14 '20 at 09:03

1

How about this:

import pandas as pd
import numpy as np

df = pd.DataFrame([[['able', 'shovel', 'door'], ['shovel raised'], ['shovel raised', 'raised', 'door', 'shovel']], [['grade control'], ['grade'], ['grade']], [['light telling', 'love'], ['would love', 'closed'], ['closed', 'light']]], columns=['first', 'second', 'third'])

df.apply(lambda row: [np.unique(np.hstack(row))], raw=True, axis=1)

The last command produces:

0        [[able, door, raised, shovel, shovel raised]]
1                             [[grade, grade control]]
2    [[closed, light, light telling, love, would lo...

which can be saved in a new column of the dataframe:

df['Distinct_set'] = df.apply(lambda row: [np.unique(np.hstack(row))], raw=True, axis=1)

answered Feb 14 '20 at 09:03

mac13k

2,423
23
34

You missed conversion of the list as a string value. Read the question again. – RockStar Feb 14 '20 at 09:15
The brackets in the example from the main question are misleading, ie. are they supposed to be the part of the actual string? Once the author provides the code to generate the dataframe I will reassess. – mac13k Feb 14 '20 at 09:56
He mentioned in the question, "Each value in a row is a string(not list)" if you see the first column from expected df also you will get to know. – RockStar Feb 14 '20 at 10:02
OK, you are right. This means that the square brackets are the part of the strings, which is weird... – mac13k Feb 14 '20 at 10:23

score 0 · Answer 3 · answered Feb 14 '20 at 08:17

You can try out below snippet

import json
def get_list_from_str(s):
    return json.loads(s.replace("'", '"'))

def flatten_list_rows(row):
    return (set(
        get_list_from_str(row['first']) + 
        get_list_from_str(row['second']) + 
        get_list_from_str(row['third']) 
    ))

df['Distinct_set'] = df.apply(flatten_list_rows, axis=1)

Updating pandas dataframe with new column

3 Answers3