4

My data is like this:

Name    test1     test2      Count
Emp1    X,Y        A           1
Emp2    X          A,B,C       2
Emp3    Z          C           3

I'm using the code below to split test1 cells with multiple values to individual rows. However, I am not sure how to split Test2 column.

df2 = df.test1.str.split(',').apply(pd.Series)
df2.index = df.set_index(['Name', 'count']).index
df2.stack().reset_index(['Name', 'count'])
df2

And the output is:

Name    test1   Count
Emp1    X        1
Emp1    Y        1
Emp2    X        2
Emp2    X        2
Emp2    X        2
Emp2    Z        3

I'm trying to split test1 and test2 so that I can achieve this output:

Name    test1    test2  Count
Emp1    X          A      1
Emp1    Y          A      1
Emp2    X          A      2
Emp2    X          B      2
Emp2    X          C      2
Emp2    Z          C      3

Can anybody help, please?

petezurich
  • 9,280
  • 9
  • 43
  • 57
PieSquare
  • 327
  • 1
  • 6
  • 23

3 Answers3

4

I am just fix your code , since I do not recommend the method you unnesting the dataframe , you can check the answer here, there are multiple nice way.

df2 = df.test1.str.split(',').apply(pd.Series)
df2.index = df.set_index(['Name', 'Count']).index
df2=df2.stack().reset_index(['Name', 'Count'])
df3 = df.test2.str.split(',').apply(pd.Series)
df3.index = df.set_index(['Name', 'Count']).index
df3=df3.stack().reset_index(['Name', 'Count'])

Just do merge here

df2.merge(df3,on=['Name', 'Count'],how='outer')
Out[132]: 
   Name  Count 0_x 0_y
0  Emp1      1   X   A
1  Emp1      1   Y   A
2  Emp2      2   X   A
3  Emp2      2   X   B
4  Emp2      2   X   C
5  Emp3      3   Z   C
BENY
  • 317,841
  • 20
  • 164
  • 234
  • @Mahesh I have tried to make my solution as generalized as possible. Take a look later and see if helps ;) – rafaelc Sep 16 '18 at 05:50
  • @RafaelC I tried the expand_all function you suggested, I observe that it is expanding the columns which are already expanded . . This is leading to duplication of values – PieSquare Sep 22 '18 at 07:17
3

I don't believe it is that straightforward to adapt this answer highlighted by @wen to this question, so I'll propose a solution.

You might create a function that takes a df, a column to be expanded and a separator for that column, and chain calls as many times as needed.

def expand(df, col, sep=','):
    r = df[col].str.split(sep)
    d = {c: df[c].values.repeat(r.str.len(), axis=0) for c in df.columns}
    d[col] = [i for sub in r for i in sub]
    return pd.DataFrame(d)

expand(expand(df, 'test1'), 'test2')

    Name    test1   test2   Count
0   Emp1    X       A       1
1   Emp1    Y       A       1
2   Emp2    X       A       2
3   Emp2    X       B       2
4   Emp2    X       C       2
5   Emp3    Z       C       3

Suppose you have a

df['test3'] = ['X1|X2|X3', 'X4', 'X5']

such that

>>> print(df)

    Name    test1   test2   Count   test3
0   Emp1    X,Y     A       1       X1|X2|X3
1   Emp2    X       A,B,C   2       X4
2   Emp3    Z       C       3       X5

Then,

>>> expand(df,'test3', '|')

    Name    test1   test2   Count   test3
0   Emp1    X,Y     A       1       X1
1   Emp1    X,Y     A       1       X2
2   Emp1    X,Y     A       1       X3
3   Emp2    X       A,B,C   2       X4
4   Emp3    Z       C       3       X5

If you think columns size may increase substantially, you can define a function expand_all to avoid having something like expand(expand(expand(expand(........)))))). For example:

def expand_all(df, cols, seps):
    ret = df
    for c,s in zip(cols,seps): ret = expand(ret,c,s)
    return ret

>>> expand_all(df, ['test1', 'test2', 'test3'], [',', ',', '|'])

    Name    test1   test2   Count   test3
0   Emp1    X       A       1       X1
1   Emp1    X       A       1       X2
2   Emp1    X       A       1       X3
3   Emp1    Y       A       1       X1
4   Emp1    Y       A       1       X2
5   Emp1    Y       A       1       X3
6   Emp2    X       A       2       X4
7   Emp2    X       B       2       X4
8   Emp2    X       C       2       X4
9   Emp3    Z       C       3       X5

Or however suitable ;)


Detail:

>>> expand(df, 'test1')

    Name    test1   test2   Count
0   Emp1    X       A       1
1   Emp1    Y       A       1
2   Emp2    X       A,B,C   2
3   Emp3    Z       C       3

>>> expand(df, 'test2')

    Name    test1   test2   Count
0   Emp1    X,Y     A       1
1   Emp2    X       A       2
2   Emp2    X       B       2
3   Emp2    X       C       2
4   Emp3    Z       C       3

>>> expand(expand(df, 'test2'), 'test1') 

    Name    test1   test2   Count
0   Emp1    X       A       1
1   Emp1    Y       A       1
2   Emp2    X       A       2
3   Emp2    X       B       2
4   Emp2    X       C       2
5   Emp3    Z       C       3


>>> expand(expand(df, 'test2'), 'test1').eq(expand(expand(df, 'test1'), 'test2')).all()

Name     True
test1    True
test2    True
Count    True
dtype: bool
rafaelc
  • 57,686
  • 15
  • 58
  • 82
1

Comprehension

pd.DataFrame(
    [(n, a, b, c)
     for n, A, B, c in zip(*map(df.get, df))
     for a in A.split(',') for b in B.split(',')],
    columns=df.columns
)

   Name test1 test2  Count
0  Emp1     X     A      1
1  Emp1     Y     A      1
2  Emp2     X     A      2
3  Emp2     X     B      2
4  Emp2     X     C      2
5  Emp3     Z     C      3
piRSquared
  • 285,575
  • 57
  • 475
  • 624