8

I have the following dummy dataframe:

df = pd.DataFrame({'Col1':['a,b,c,d', 'e,f,g,h', 'i,j,k,l,m'],
                   'Col2':['aa~bb~cc~dd', np.NaN, 'ii~jj~kk~ll~mm']})

        Col1            Col2
0    a,b,c,d     aa~bb~cc~dd
1    e,f,g,h             NaN
2  i,j,k,l,m  ii~jj~kk~ll~mm

The real dataset has shape 500000, 90.

I need to unnest these values to rows and I'm using the new explode method for this, which works fine.

The problem is the NaN, these will cause unequal lengths after the explode, so I need to fill in the same amount of delimiters as the filled values. In this case ~~~ since row 1 has three comma's.


expected output

        Col1            Col2
0    a,b,c,d     aa~bb~cc~dd
1    e,f,g,h             ~~~
2  i,j,k,l,m  ii~jj~kk~ll~mm

Attempt 1:

df['Col2'].fillna(df['Col1'].str.count(',')*'~')

Attempt 2:

np.where(df['Col2'].isna(), df['Col1'].str.count(',')*'~', df['Col2'])

This works, but I feel like there's an easier method for this:

characters = df['Col1'].str.replace('\w', '').str.replace(',', '~')
df['Col2'] = df['Col2'].fillna(characters)

print(df)

        Col1            Col2
0    a,b,c,d     aa~bb~cc~dd
1    e,f,g,h             ~~~
2  i,j,k,l,m  ii~jj~kk~ll~mm

d1 = df.assign(Col1=df['Col1'].str.split(',')).explode('Col1')[['Col1']]
d2 = df.assign(Col2=df['Col2'].str.split('~')).explode('Col2')[['Col2']]

final = pd.concat([d1,d2], axis=1)
print(final)

  Col1 Col2
0    a   aa
0    b   bb
0    c   cc
0    d   dd
1    e     
1    f     
1    g     
1    h     
2    i   ii
2    j   jj
2    k   kk
2    l   ll
2    m   mm

Question: is there an easier and more generalized method for this? Or is my method fine as is.

Nick ODell
  • 15,465
  • 3
  • 32
  • 66
Erfan
  • 40,971
  • 8
  • 66
  • 78
  • `np.where(df['Col2'].isna(), ['~'*x for x in df['Col1'].str.count(',')], df['Col2'])`? – Quang Hoang Sep 03 '19 at 15:11
  • Drop the nan, then append it later on – BENY Sep 03 '19 at 15:18
  • Do you need the original Index? – ALollz Sep 03 '19 at 15:21
  • No I don't @ALollz – Erfan Sep 03 '19 at 15:25
  • I'm not sure why you are assigning and them exploding on the dataframe, why not operate on the series to simplify the indexing? – user3483203 Sep 03 '19 at 15:37
  • I am following the [`docs`](https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.25.0.html#series-explode-to-split-list-like-values-to-rows) which shows how to unnest a `string` column. @user3483203 – Erfan Sep 03 '19 at 15:38
  • That's if you want to keep the original DataFrame. In this case, you don't (you just index the original column anyways), and it's only slowing your approach down. If you removed that, it might be more efficient than some of the answers proposed here. – user3483203 Sep 03 '19 at 15:38
  • You are right, I should only do that for the first `explode` since I have some `id` columns at the front which I need to retain, after that I can do `Series.explode`. Thanks @user3483203 – Erfan Sep 03 '19 at 15:41

4 Answers4

4

One way is using str.repeat and fillna() not sure how efficient this is though:

df.Col2.fillna(pd.Series(['~']*len(df)).str.repeat(df.Col1.str.count(',')))

0       aa~bb~cc~dd
1               ~~~
2    ii~jj~kk~ll~mm
Name: Col2, dtype: object
anky
  • 74,114
  • 11
  • 41
  • 70
4

pd.concat

delims = {'Col1': ',', 'Col2': '~'}
pd.concat({
    k: df[k].str.split(delims[k], expand=True)
    for k in df}, axis=1
).stack()

    Col1 Col2
0 0    a   aa
  1    b   bb
  2    c   cc
  3    d   dd
1 0    e  NaN
  1    f  NaN
  2    g  NaN
  3    h  NaN
2 0    i   ii
  1    j   jj
  2    k   kk
  3    l   ll
  4    m   mm

This loops on columns in df. It may be wiser to loop on keys in the delims dictionary.

delims = {'Col1': ',', 'Col2': '~'}
pd.concat({
    k: df[k].str.split(delims[k], expand=True)
    for k in delims}, axis=1
).stack()

Same thing, different look

delims = {'Col1': ',', 'Col2': '~'}
def f(c): return df[c].str.split(delims[c], expand=True)
pd.concat(map(f, delims), keys=delims, axis=1).stack()
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • This looks promissing with the `delims` dict. Let me test, but this should be easily maintainable and generalizable to more columns. – Erfan Sep 03 '19 at 15:36
3

zip_longest can be useful here, given you don't need the original Index. It will work regardless of which column has more splits:

from itertools import zip_longest, chain

df = pd.DataFrame({'Col1':['a,b,c,d', 'e,f,g,h', 'i,j,k,l,m', 'x,y'],
                   'Col2':['aa~bb~cc~dd', np.NaN, 'ii~jj~kk~ll~mm', 'xx~yy~zz']})
#        Col1            Col2
#0    a,b,c,d     aa~bb~cc~dd
#1    e,f,g,h             NaN
#2  i,j,k,l,m  ii~jj~kk~ll~mm
#3        x,y        xx~yy~zz

l = [zip_longest(*x, fillvalue='') 
     for x in zip(df.Col1.str.split(',').fillna(''), 
                  df.Col2.str.split('~').fillna(''))]

pd.DataFrame(chain.from_iterable(l))

    0   1
0   a  aa
1   b  bb
2   c  cc
3   d  dd
4   e    
5   f    
6   g    
7   h    
8   i  ii
9   j  jj
10  k  kk
11  l  ll
12  m  mm
13  x  xx
14  y  yy
15     zz
ALollz
  • 57,915
  • 7
  • 66
  • 89
  • Just tested it and it produces the wrong output, check row 13,14,15 in your example. – Erfan Sep 03 '19 at 15:31
  • Maybe I was not clear enough, but if you check my expected output you can see it. The real dataset is a change history where the first column tracks the column names and the second one notes the change of the value. So the first value should be matched with the first value only in the first row. – Erfan Sep 03 '19 at 15:35
  • I tried your solution with some edits, I can't get to the correct output unfortunately. – Erfan Sep 03 '19 at 15:43
  • Yes, I guess it would be tricky if you have ID or other columns that you need to retain. – ALollz Sep 03 '19 at 15:45
3

Just split the dataframe into two

df1=df.dropna()
df2=df.drop(df1.index)

d1 = df1['Col1'].str.split(',').explode()
d2 = df1['Col2'].str.split('~').explode()
d3 = df2['Col1'].str.split(',').explode()

final = pd.concat([d1, d2], axis=1).append(d3.to_frame(),sort=False)
Out[77]: 
  Col1 Col2
0    a   aa
0    b   bb
0    c   cc
0    d   dd
2    i   ii
2    j   jj
2    k   kk
2    l   ll
2    m   mm
1    e  NaN
1    f  NaN
1    g  NaN
1    h  NaN
BENY
  • 317,841
  • 20
  • 164
  • 234
  • This comes closest to what I want, the problem is that I have quite some of these columns – Erfan Sep 03 '19 at 15:33