2

Task1

supposed dataset

    Name    B   C
0   James   a   a,b,c,d
1   James   a   NaN
2   Rudy    b   a,f
3   Karl    c   e,c

In column c, the values are in list form, and I want to split them up and add them to the row. Delete the value which C column is NaN

the ouptut what I want

    Name    B   C
0   James   a   a
1   James   a   b
2   James   a   c
3   James   a   d
4   Rudy    b   a
5   Rudy    b   f
6   Karl    c   e
7   Karl    c   c

Task2

I would like to label based on the relationship between James, Rudy, Karl and column 'C'.

Label standard ( mean intersection)

Label    column 'C' value
 0       James  
 1       Rudy   
 2       Karl   
 3       James ∩ Rudy   
 4       James ∩ Karl       
 5       Rudy ∩ Karl        
 6       James ∩ Rudy ∩ Karl

I want to label according to where each coulmn 'C' value belongs. Label logic

The final result reflects what I want

    Name    B   C   Label
0   James   a   a   3
1   James   a   b   0
2   James   a   c   4
3   James   a   d   0
4   Rudy    b   a   3
5   Rudy    b   f   1
6   Karl    c   e   2
7   Karl    c   c   4

For example, 'a' in column 'C' is labeled 3 because it is in both James and Rudy

Difficult for me. I'd appreciate if you help me.

Thank you for reading.

ybin
  • 555
  • 1
  • 3
  • 13

3 Answers3

1

For Task 1, if the data in columns C is list as you said, you can use explode.

df.explode('C').dropna()

    Name    B   C
0   James   a   a
0   James   a   b
0   James   a   c
0   James   a   d
2   Rudy    b   a
2   Rudy    b   f
3   Karl    c   e
3   Karl    c   c

For task 2, can't quite understand the logic.

Allen Qin
  • 19,507
  • 8
  • 51
  • 67
1

First part use DataFrame.explode with DataFrame.dropna and DataFrame.reset_index with drop=True for default index:

#if values are lists
df1 = df.explode('C').dropna(subset=['C']).reset_index(drop=True)
#if values are separated by , add split
#df1 = df.assign(C = df['C'].str.split(',')).explode('C').dropna(subset=['C']).reset_index(drop=True)
print (df1)
    Name  B  C
0  James  a  a
0  James  a  b
0  James  a  c
0  James  a  d
2   Rudy  b  a
2   Rudy  b  f
3   Karl  c  e
3   Karl  c  c

Then create second DataFrame by hashable sets called frozensets, so ordering of values is not important:

#https://stackoverflow.com/a/5898031
from itertools import chain, combinations
def all_subsets(ss):
    return chain(*map(lambda x: combinations(ss, x), range(1, len(ss)+1)))

L = [(i, frozenset(x)) for i, x in enumerate(all_subsets(df['Name'].unique()))]
df2 = pd.DataFrame(L, columns=['Label','C'])
print (df2)
   Label                    C
0      0              (James)
1      1               (Rudy)
2      2               (Karl)
3      3        (Rudy, James)
4      4        (James, Karl)
5      5         (Rudy, Karl)
6      6  (Rudy, James, Karl)

And for second create Series with DataFrame.set_index, which is used for Series.map for add frozensets and then for add Labels:

s = df2.set_index('C')['Label']
df["Label"] = df['C'].map(df.groupby('C')['Name'].apply(frozenset)).map(s)
print (df)

    Name  B  C  Label
0  James  a  a      3
1  James  a  b      0
2  James  a  c      4
3  James  a  d      0
4   Rudy  b  a      3
5   Rudy  b  f      1
6   Karl  c  e      2
7   Karl  c  c      4
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/207749/discussion-on-answer-by-jezrael-divide-list-break-in-column-into-rows-and-multip). – Samuel Liew Feb 13 '20 at 11:29
1
import pandas as pd
import numpy as np

df = pd.DataFrame({'Name':['James', 'James', 'Rudy','Karl'],
                   'B':['a','a','b','c'],
                   'C':[['a','b','c','d'], np.nan, ['a','f'], ['e','c']]})

# Task 1
df = df.explode(column='C').reset_index(drop=True)
df.dropna(inplace=True)


# Task 2
labels = {'James'                :0,
          'Rudy'                 :1,
          'Karl'                 :2,
          'James ∩ Rudy'         :3,
          'James ∩ Karl'         :4,
          'Karl ∩ Ruby'          :5,
          'James ∩ Karl ∩ Rudy'  :6}

C_to_labels = df.groupby('C')['Name'].apply(lambda x: labels[' ∩ '.join(sorted(x))])
df['Label'] = df['C'].map(C_to_labels)

Result:

    Name  B  C  Label
0  James  a  a      3
1  James  a  b      0
2  James  a  c      4
3  James  a  d      0
4   Rudy  b  a      3
5   Rudy  b  f      1
6   Karl  c  e      2
7   Karl  c  c      4
Andreas K.
  • 9,282
  • 3
  • 40
  • 45