Extract unique values from pandas frame per row and add them to a new column

Question

I have the following problem I want to solve:

Let's say I have a datafram which looks like this

match              0              1              2        3    4    5    6    7
1           Morocco         France           Morocco      NaN  NaN  NaN  NaN  NaN
2           Morocco         France           Morocco      NaN  NaN  NaN  NaN  NaN
3           Morocco         France           NaN          NaN  NaN  NaN  NaN  NaN
4           China           United States    NaN          NaN  NaN  NaN  NaN  NaN
5           China           NaN              NaN          NaN  NaN  NaN  NaN  NaN

I'm looking for a way to find the unique Values in each row and add them to another column while dropping all NaNs.

The output should look like the following:

match       8
1           Morocco, France
2           Morocco, France
3           Morocco, France
4           China, United States
5           China

Any suggestions on how to solve this?

What is the issue, exactly? Have you tried anything, done any research? Stack Overflow is not a free code writing service. See: [tour], [ask], [help/on-topic], https://meta.stackoverflow.com/questions/261592/how-much-research-effort-is-expected-of-stack-overflow-users. You haven't even shared the data in a format which is somewhat convenient to use. Where does that DataFrame come from in the first place? Is there no way to create it correctly/change it beforehand? — AMC, Mar 10 '20 at 19:35
I did quite some research, but didn't found any promising results for this kind of problem os I decided to ask this question. Of course the question could be approved, but by the amount of answers given, the problem seemed to be fully understandable without including unnecessary information — Jan-Felix Klein, Mar 11 '20 at 06:51
_didn’t found any promising results for this kind of problem_ https://stackoverflow.com/questions/19377969/combine-two-columns-of-text-in-dataframe-in-pandas-python, https://stackoverflow.com/questions/33158417/pandas-combine-two-strings-ignore-nan-values. — AMC, Mar 11 '20 at 12:11

score 2 · Answer 1 · answered Mar 10 '20 at 19:07

2

# Convert each column dtype to str: x.astype(str)
# Null dtype became 'nan' so remove it: replace('nan', "")
# Concatenate each row entry: sum()
# Convert it to set to delete duplicate entries 
# Convert it to list to concatenate with "," as a string

df_new = df.apply(lambda x: ",".join(list(set(((x.astype(str)).sum()).replace('nan', "")))), axis=1)

answered Mar 10 '20 at 19:07

isabella

467
3
11

I tested you answered and modified it to `df_new = df.apply(lambda x: ", ".join(list(set(x.dropna().astype(str)))), axis=1)` to get the wanted results – Jan-Felix Klein Mar 11 '20 at 06:46

score 2 · Answer 2 · answered Mar 10 '20 at 19:17

Here is an attempt combining set and list in a lambda:

df_ex[8] = [x for x in df_ex[[0,1,2,3,4,5,6,7]].values.tolist()]
df_ex[8] = df_ex[8].apply(lambda x: list(set([y for y in x if str(y)!='nan'])))

Output:

0         [Morocco, France]
1         [Morocco, France]
2         [Morocco, France]
3    [United States, China]
4                   [China]

ansev · Answer 3 · 2020-03-10T19:06:54.013

Use:

cols = df.columns[df.columns.str.isnumeric()]
#or selecting columns
#cols = df.columns[1:]
#cols = df.columns.difference(['match'])
df[int(cols[-1])+1]=df[cols].agg(lambda x: ', '.join(set(x.dropna())),axis=1)
#for string type
#df[f'{int(cols[-1])+1}']=df[cols].stack().groupby(level=0).agg(', '.join)
df = df.reindex(columns = df.columns.difference(cols))
print(df)

                      8  match
0       France, Morocco      1
1       France, Morocco      2
2       France, Morocco      3
3  China, United_States      4
4                 China      5

Also we can use:

df[int(cols[-1])+1] = (df[cols].stack()
                               .groupby(level=0)
                               .agg(lambda x: ', '.join(set(x)),axis=1))

Moo10000 · Answer 4 · 2020-03-10T19:30:26.220

0

A long way to do it

dee = dict(tuple(df.groupby('Match')))
tmp = []
tmp2 = []
for k,v in dee.items():

    tmp.append(k)

    for i in v.columns.tolist():
        tmp3 = []
        #print(i)
        tmp3.append(str(v[i]))
    tmp2.append(tmp3)
new = pd.DataFrame({'Match':tmp,'List':tmp2})

edited Mar 10 '20 at 19:30

answered Mar 10 '20 at 19:24

Moo10000

115
11

I think you definitely don't want nested loops! – Celius Stingher Mar 10 '20 at 19:36
I hear this often, and I assume this due to memory and time? Dont developers use nested loops often? I did a time check on this version of the code using a 27000 row by 26 column dataset. Took 1.34 seconds. Will compare to one of the other posted methods. – Moo10000 Mar 10 '20 at 19:41
1

It's not that it's evil per se, and it's a perfectly valid answer to the question. It's more that it should be avoided if there's a better solution available. Especially on a site like stackoverflow, which is meant to be a knowledge repository more than a code-help service. Consider a case where someone runs into a similar situation next year and finds this question. Only their dataset is 27,000,000 rows and 500 columns, and there are processing time requirements for their business. Now a vectorized approach that takes 2ms vs 50ms on smaller data is looking more and more preferable. – G. Anderson Mar 10 '20 at 20:10
@G.Anderson I tried the answer posted by user Isabella. On my machine I got a faster time with my method, is there a range where one method works better than the other? Could make for an interesting blog page ` 1.17 s ± 182 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 4.69 s ± 174 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)` – Moo10000 Mar 10 '20 at 20:31
The actual run time is going to be highly dependent on the machine and environment, but the best test is usually to create bigger and bigger (and more diverse) datasets and see whether one method scales better than another. I will say that operating on a dict as you've done is likely better than the same iteration on the dataframe with, for example, `iterrows`. For my machine, your solution runs `50.1 ms ± 12.9 ms`, and `df.iloc[:,1:].apply(lambda x: set([i for i in x if str(i) !='nan']), axis=1)` runs in `2.38 ms ± 134 µs` – G. Anderson Mar 10 '20 at 20:55
One other point worth mentioning: `apply(axis=1)` coupled with `[i for i in...]` _is_ in fact a nested loop, it's just optimized under the hood using built-in methods to run faster on the underlying data structures. A good resource in a stack overflow question: [Are for-loops in pandas really bad? When should I care?](https://stackoverflow.com/questions/54028199/are-for-loops-in-pandas-really-bad-when-should-i-care) – G. Anderson Mar 10 '20 at 20:57
1

@G.Anderson interesting, thank you for the insight, there sure is a lot to learn about this python stuff – Moo10000 Mar 11 '20 at 04:00
Always more to learn! That's the beauty of this site. cheers! – G. Anderson Mar 11 '20 at 14:11

Extract unique values from pandas frame per row and add them to a new column

4 Answers4