2

I do have a dataframe like this:

import pandas as pd

df = pd.DataFrame({"c0": list('ABC'),
                   "c1": [" ".join(list('ab')), " ".join(list('def')), " ".join(list('s'))],
                   "c2": list('DEF')})

  c0     c1 c2
0  A    a b  D
1  B  d e f  E
2  C      s  F

I want to create a pivot table that looks like this:

      c2
c0 c1   
A  a   D
   b   D
B  d   E
   e   E
   f   E
C  s   F

So, the entries in c1 are split and then treated as single elements used in a multiindex.

I do this as follows:

newdf = pd.DataFrame()

for indi, rowi in df.iterrows():

    # get all single elements in string
    n_elements = rowi['c1'].split()

    # only one element so we can just add the entire row
    if len(n_elements) == 1:
        newdf = newdf.append(rowi)
    # more than one element
    else:
        for eli in n_elements:
            # that allows to add new elements using loc, without it we will have identical index values
            if not newdf.empty:
                newdf = newdf.reset_index(drop=True)
                newdf.index = -1 * newdf.index - 1

            # add entire row
            newdf = newdf.append(rowi)
            # replace the entire string by the single element
            newdf.loc[indi, 'c1'] = eli

print newdf.reset_index(drop=True)

which yields

  c0 c1 c2
0  A  a  D
1  A  b  D
2  B  d  E
3  B  e  E
4  B  f  E
5  C  s  F

Then I can just call

pd.pivot_table(newdf, index=['c0', 'c1'], aggfunc=lambda x: ' '.join(set(str(v) for v in x)))

which gives me the desired output (see above).

For huge dataframes that can be quite slow, so I am wondering whether there is a more efficient way of doing this.

Cleb
  • 25,102
  • 20
  • 116
  • 151

2 Answers2

3

This is how I get the result , In R it is called unnest.

df.c1=df.c1.apply(lambda x : pd.Series(x).str.split(' '))

df.set_index(['c0', 'c2'])['c1'].apply(pd.Series).stack().reset_index().drop('level_2',1).rename(columns={0:'c1'}).set_index(['c0','c1'])


Out[208]: 
      c2
c0 c1   
A  a   D
   b   D
B  d   E
   e   E
   f   E
C  s   F
BENY
  • 317,841
  • 20
  • 164
  • 234
  • 1
    Works fine (upvoted), but will need a while to get through it ;). Just cosmetics: You can just use `split()`. – Cleb Aug 24 '17 at 21:45
3

Option 1

import numpy as np, pandas as pd

s = df.c1.str.split()
l = s.str.len()
newdf = df.loc[df.index.repeat(l)].assign(c1=np.concatenate(s)).set_index(['c0', 'c1'])
newdf

      c2
c0 c1   
A  a   D
   b   D
B  d   E
   e   E
   f   E
C  s   F

Option 2
Should be faster

import numpy as np, pandas as pd

s = np.core.defchararray.split(df.c1.values.astype(str), ' ')
l = [len(x) for x in s.tolist()]
r = np.arange(len(s)).repeat(l)
i = pd.MultiIndex.from_arrays([
    df.c0.values[r],
    np.concatenate(s)
], names=['c0', 'c1'])
newdf = pd.DataFrame({'c2': df.c2.values[r]}, i)
newdf

      c2
c0 c1   
A  a   D
   b   D
B  d   E
   e   E
   f   E
C  s   F
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • Very nice (upvoted) and still readable! Just for completeness: you could add the `import numpy as np` line and assign it to `newdf`. – Cleb Aug 24 '17 at 21:50
  • Hi Pir, It take me time to reading thru it and figure out how efficient your method are! Thank you for sharing ~ – BENY Aug 24 '17 at 22:02