3

Say I have the below dataframe:

x = pd.DataFrame({'a':['x, y', 'x, t, x, r', 'y, t'],
          'b':[1, 0, 1]})

            a  b
0        x, y  1
1  x, t, x, r  0
2        y, t  1

I would like to get to

  letter  num
0      x    1
1      y    1
2      x    0
3      t    0
4      x    0
5      r    0
6      y    1
7      t    1

I have solved the issue the following way, but I feel like i'm making it more complicated than it needs to be.

x.a = x.a.str.split(",")

empty = []
for b, a in zip(x.b, x.a):
    empty.append([b] * len(a))

t = [item for sublist in empty for item in sublist]
y = [item for sublist in x.a for item in sublist]

pd.DataFrame({'letter':t, 'num':y})

   letter num
0       1   x
1       1   y
2       0   x
3       0   t
4       0   x
5       0   r
6       1   y
7       1   t

Is there a better way to solve this problem?

Matt W.
  • 3,692
  • 2
  • 23
  • 46

3 Answers3

4

Use split for lists first by regex - ,\s+ for comma with one or more spaces, and then numpy.repeat with flatenning by numpy.concatenate and last DataFrame constructor:

a = x.a.str.split(",\s+")
b = np.repeat(x.b.values, a.str.len())
c = np.concatenate(a.values)

df = pd.DataFrame({'letter':c, 'num':b})
print (df)
  letter  num
0      x    1
1      y    1
2      x    0
3      t    0
4      x    0
5      r    0
6      y    1
7      t    1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
3

Let's expand each row by one plus the number of commas in column 'a'. Then overwrite column 'a' with new values.

x.loc[
    x.index.repeat(x.a.str.count(', ') + 1)
].assign(
    a=', '.join(x.a).split(', ')
).rename(columns=dict(a='letter', b='num'))

  letter  num
0      x    1
0      y    1
1      x    0
1      t    0
1      x    0
1      r    0
2      y    1
2      t    1
piRSquared
  • 285,575
  • 57
  • 475
  • 624
2

Just find a new way :-)

x.set_index('b').a.str.get_dummies(sep=',').replace(0,np.nan).stack().reset_index().drop(0,1)
Out[481]: 
   b level_1
0  1       y
1  1       x
2  0       r
3  0       t
4  0       x
5  0       x
6  1       t
7  1       y
BENY
  • 317,841
  • 20
  • 164
  • 234