5

My data in ddata.csv is as follows:

col1,col2,col3,col4
A,10,a;b;c, 20
B,30,d;a;b,40
C,50,g;h;a,60

I want to separate col3 into multiple columns, but based on their values. In other wants, I would like my final data to look like

col1, col2, name_a, name_b, name_c, name_d, name_g, name_h, col4
A,    10,   a,      b,      c,      NULL,   NULL,   NULL,   20
B,    30,   a,      b,      NULL,   d,      NULL,   NULL,   40
C,    50,   a,      NULL,   NULL,   NULL,   g,      h,      60

My code, at the moment taken reference from this answer, is incomplete:

import pandas as pd

import string
L = list(string.ascii_lowercase)

names = dict(zip(range(len(L)), ['name_' + x for x in  L]))
df = pd.read_csv('ddata.csv')
df2 = df['col3'].str.split(';', expand=True).rename(columns=names)

Column names 'a','b','c' ... are taken at random, and has no relevance to the actual data a,b,c.

Right now, my code can just split 'col3' into three columns as follows:

name_a name_b name_c
a      b      c
d      e      f
g      h      i

But, it should be like name_a, name_b, name_c, name_d, name_g, name_h a, b, c, NULL, NULL, NULL a, b, NULL, d, NULL, NULL a, NULL, NULL, NULL, g, h

and in the end, I need to just replace col3 with these multiple columns.

kingmakerking
  • 2,017
  • 2
  • 28
  • 44
  • @razdi My code just splits column 3 into 3 columns - name_a, name_b, and name_c. But, column name_a should contain only 'a' and not 'd' or 'h'. I need to segregate the columns based on their values. – kingmakerking May 13 '19 at 06:55

2 Answers2

6

Use Series.str.get_dummies:

print (df['col3'].str.get_dummies(';'))
   a  b  c  d  g  h
0  1  1  1  0  0  0
1  1  1  0  1  0  0
2  1  0  0  0  1  1

For extract column col3 from original use DataFrame.pop, create new DataFrame by multiple values by columns names in numpy, replace NaNs instead empty strings with DataFrame.where and DataFrame.add_prefix for new columns names.

pos = df.columns.get_loc('col3')

df2 = df.pop('col3').str.get_dummies(';').astype(bool)
df2 = (pd.DataFrame(df2.values * df2.columns.values[ None, :], 
                    columns=df2.columns,
                    index=df2.index)
         .where(df2)
         .add_prefix('name_'))

Last join all DataFrames filtered by positions with iloc join together by concat:

df = pd.concat([df.iloc[:, :pos], df2, df.iloc[:, pos:]], axis=1)
print (df)
  col1  col2 name_a name_b name_c name_d name_g name_h  col4
0    A    10      a      b      c    NaN    NaN    NaN    20
1    B    30      a      b    NaN      d    NaN    NaN    40
2    C    50      a    NaN    NaN    NaN      g      h    60
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    Thanks a lot. How do I now just replace col3 with these multiple columns? without disturbing rest of the columns/rows? – kingmakerking May 13 '19 at 07:23
0

@jezrael solution is excellent. I did not know str.get_dummies until now.

I come up with solution using stack, pivot_table, np.where and pd.concat

df1 = df.col3.str.split(';', expand=True).stack().reset_index(level=0)
df2 = pd.pivot_table(df1, index='level_0', columns=df1[0], aggfunc=len)

Out[1658]:
0          a    b    c    d    g    h
level_0
0        1.0  1.0  1.0  NaN  NaN  NaN
1        1.0  1.0  NaN  1.0  NaN  NaN
2        1.0  NaN  NaN  NaN  1.0  1.0

Next, populate 1.0 with column names using np.where, find index of col3 and using pd.concat to construct final df

df2[:] = np.where(df2.isna(), np.nan, df2.columns)
i = df.columns.tolist().index('col3')
pd.concat([df.iloc[:,:i], df2.add_prefix('name_'), df.iloc[:,i+1:]], axis=1)

Out[1667]:
  col1  col2 name_a name_b name_c name_d name_g name_h  col4
0    A    10      a      b      c    NaN    NaN    NaN    20
1    B    30      a      b    NaN      d    NaN    NaN    40
2    C    50      a    NaN    NaN    NaN      g      h    60
kingmakerking
  • 2,017
  • 2
  • 28
  • 44
Andy L.
  • 24,909
  • 4
  • 17
  • 29