3

I have a Python Pandas DataFrame like the following:

      1
0  a, b
1     c
2     d
3     e

a, b is a string representing a list of user features

How can I convert this into a binary matrix of the user features like the following:

     a    b    c    d    e
0    1    1    0    0    0
1    0    0    1    0    0
2    0    0    0    1    0
3    0    0    0    0    1

I saw a similar question Creating boolean matrix from one column with pandas but the column does not contain entries which are lists.

I have tried these approaches, is there a way to merge the two:

pd.get_dummies()

pd.get_dummies(df[1])


   a, b  c  d  e
0     1  0  0  0
1     0  1  0  0
2     0  0  1  0
3     0  0  0  1

df[1].apply(lambda x: pd.Series(x.split()))

      1
0  a, b
1     c
2     d
3     e

Also interested in different ways to create this type of binary matrix!

Any help is appreciated!

Thanks

Community
  • 1
  • 1
jfive
  • 1,291
  • 3
  • 14
  • 21

2 Answers2

7

I think you can use:

df = df.iloc[:,0].str.split(', ', expand=True)
       .stack()
       .reset_index(drop=True)
       .str.get_dummies()

print df
   a  b  c  d  e
0  1  0  0  0  0
1  0  1  0  0  0
2  0  0  1  0  0
3  0  0  0  1  0
4  0  0  0  0  1

EDITED:

print df.iloc[:,0].str.replace(' ','').str.get_dummies(sep=',')
   a  b  c  d  e
0  1  1  0  0  0
1  0  0  1  0  0
2  0  0  0  1  0
3  0  0  0  0  1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • There's no need to chain so many operations together just to make it a one-liner.. – DSM Apr 07 '16 at 22:05
  • Interestingly, works for `10,000` rows but iPython kernel dies on `100,000` rows upward, will try to computer in blocks of 10,000 and vertically concatenate. – jfive Apr 07 '16 at 23:07
  • @jezrael, I realised this actually adds an extra row, which is undesirable, is there any way around this? – jfive Apr 08 '16 at 22:17
  • I dont understand, can you explain more? – jezrael Apr 08 '16 at 22:18
  • @jezrael, in the original matrix, there is only rows 0-3, this should be maintained in the output, I will update my question output now! – jfive Apr 08 '16 at 22:20
-1

I wrote a general function, with support for grouping, to do this a while back:

def sublist_uniques(data,sublist):
    categories = set()
    for d,t in data.iterrows():
        try:
            for j in t[sublist]:
                categories.add(j)
        except:
            pass
    return list(categories)

def sublists_to_dummies(f,sublist,index_key = None):
    categories = sublist_uniques(f,sublist)
    frame = pd.DataFrame(columns=categories)
    for d,i in f.iterrows():
        if type(i[sublist]) == list or np.array:
            try:
                if index_key != None:
                    key = i[index_key]
                    f =np.zeros(len(categories))
                    for j in i[sublist]:
                        f[categories.index(j)] = 1
                    if key in frame.index:
                        for j in i[sublist]:
                            frame.loc[key][j]+=1
                    else:
                        frame.loc[key]=f
                else:
                    f =np.zeros(len(categories))
                    for j in i[sublist]:
                        f[categories.index(j)] = 1
                    frame.loc[d]=f
            except:
                pass

    return frame
In [15]: a
Out[15]:
   a group     labels
0  1   new     [a, d]
1  2   old  [a, g, h]
2  3   new  [i, m, a]

In [16]: sublists_to_dummies(a,'labels')
Out[16]:
   a  d  g  i  h  m
0  1  1  0  0  0  0
1  1  0  1  0  1  0
2  1  0  0  1  0  1

In [17]: sublists_to_dummies(a,'labels','group')
Out[17]:
     a  d  g  i  h  m
new  2  1  0  1  0  1
old  1  0  1  0  1  0