Fill in same amount of characters where other column is NaN

Question

I have the following dummy dataframe:

df = pd.DataFrame({'Col1':['a,b,c,d', 'e,f,g,h', 'i,j,k,l,m'],
                   'Col2':['aa~bb~cc~dd', np.NaN, 'ii~jj~kk~ll~mm']})

        Col1            Col2
0    a,b,c,d     aa~bb~cc~dd
1    e,f,g,h             NaN
2  i,j,k,l,m  ii~jj~kk~ll~mm

The real dataset has shape 500000, 90.

I need to unnest these values to rows and I'm using the new explode method for this, which works fine.

The problem is the NaN, these will cause unequal lengths after the explode, so I need to fill in the same amount of delimiters as the filled values. In this case ~~~ since row 1 has three comma's.

expected output

        Col1            Col2
0    a,b,c,d     aa~bb~cc~dd
1    e,f,g,h             ~~~
2  i,j,k,l,m  ii~jj~kk~ll~mm

Attempt 1:

df['Col2'].fillna(df['Col1'].str.count(',')*'~')

Attempt 2:

np.where(df['Col2'].isna(), df['Col1'].str.count(',')*'~', df['Col2'])

This works, but I feel like there's an easier method for this:

characters = df['Col1'].str.replace('\w', '').str.replace(',', '~')
df['Col2'] = df['Col2'].fillna(characters)

print(df)

        Col1            Col2
0    a,b,c,d     aa~bb~cc~dd
1    e,f,g,h             ~~~
2  i,j,k,l,m  ii~jj~kk~ll~mm

d1 = df.assign(Col1=df['Col1'].str.split(',')).explode('Col1')[['Col1']]
d2 = df.assign(Col2=df['Col2'].str.split('~')).explode('Col2')[['Col2']]

final = pd.concat([d1,d2], axis=1)
print(final)

  Col1 Col2
0    a   aa
0    b   bb
0    c   cc
0    d   dd
1    e     
1    f     
1    g     
1    h     
2    i   ii
2    j   jj
2    k   kk
2    l   ll
2    m   mm

Question: is there an easier and more generalized method for this? Or is my method fine as is.

`np.where(df['Col2'].isna(), ['~'*x for x in df['Col1'].str.count(',')], df['Col2'])`? — Quang Hoang, Sep 03 '19 at 15:11
I'm not sure why you are assigning and them exploding on the dataframe, why not operate on the series to simplify the indexing? — user3483203, Sep 03 '19 at 15:37
I am following the [`docs`](https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.25.0.html#series-explode-to-split-list-like-values-to-rows) which shows how to unnest a `string` column. @user3483203 — Erfan, Sep 03 '19 at 15:38
That's if you want to keep the original DataFrame. In this case, you don't (you just index the original column anyways), and it's only slowing your approach down. If you removed that, it might be more efficient than some of the answers proposed here. — user3483203, Sep 03 '19 at 15:38
You are right, I should only do that for the first `explode` since I have some `id` columns at the front which I need to retain, after that I can do `Series.explode`. Thanks @user3483203 — Erfan, Sep 03 '19 at 15:41

score 4 · Answer 1 · answered Sep 03 '19 at 15:15

4

One way is using str.repeat and fillna() not sure how efficient this is though:

df.Col2.fillna(pd.Series(['~']*len(df)).str.repeat(df.Col1.str.count(',')))

0       aa~bb~cc~dd
1               ~~~
2    ii~jj~kk~ll~mm
Name: Col2, dtype: object

answered Sep 03 '19 at 15:15

anky

74,114
11
41
70

piRSquared · Accepted Answer · 2019-09-03T16:00:19.650

`pd.concat`

delims = {'Col1': ',', 'Col2': '~'}
pd.concat({
    k: df[k].str.split(delims[k], expand=True)
    for k in df}, axis=1
).stack()

    Col1 Col2
0 0    a   aa
  1    b   bb
  2    c   cc
  3    d   dd
1 0    e  NaN
  1    f  NaN
  2    g  NaN
  3    h  NaN
2 0    i   ii
  1    j   jj
  2    k   kk
  3    l   ll
  4    m   mm

This loops on columns in df. It may be wiser to loop on keys in the delims dictionary.

delims = {'Col1': ',', 'Col2': '~'}
pd.concat({
    k: df[k].str.split(delims[k], expand=True)
    for k in delims}, axis=1
).stack()

Same thing, different look

delims = {'Col1': ',', 'Col2': '~'}
def f(c): return df[c].str.split(delims[c], expand=True)
pd.concat(map(f, delims), keys=delims, axis=1).stack()

This looks promissing with the `delims` dict. Let me test, but this should be easily maintainable and generalizable to more columns. — Erfan, Sep 03 '19 at 15:36

ALollz · Answer 3 · 2019-09-03T15:36:57.010

3

zip_longest can be useful here, given you don't need the original Index. It will work regardless of which column has more splits:

from itertools import zip_longest, chain

df = pd.DataFrame({'Col1':['a,b,c,d', 'e,f,g,h', 'i,j,k,l,m', 'x,y'],
                   'Col2':['aa~bb~cc~dd', np.NaN, 'ii~jj~kk~ll~mm', 'xx~yy~zz']})
#        Col1            Col2
#0    a,b,c,d     aa~bb~cc~dd
#1    e,f,g,h             NaN
#2  i,j,k,l,m  ii~jj~kk~ll~mm
#3        x,y        xx~yy~zz

l = [zip_longest(*x, fillvalue='') 
     for x in zip(df.Col1.str.split(',').fillna(''), 
                  df.Col2.str.split('~').fillna(''))]

pd.DataFrame(chain.from_iterable(l))

    0   1
0   a  aa
1   b  bb
2   c  cc
3   d  dd
4   e    
5   f    
6   g    
7   h    
8   i  ii
9   j  jj
10  k  kk
11  l  ll
12  m  mm
13  x  xx
14  y  yy
15     zz

edited Sep 03 '19 at 15:36

answered Sep 03 '19 at 15:15

ALollz

57,915
7
66
89

Just tested it and it produces the wrong output, check row 13,14,15 in your example. – Erfan Sep 03 '19 at 15:31
Maybe I was not clear enough, but if you check my expected output you can see it. The real dataset is a change history where the first column tracks the column names and the second one notes the change of the value. So the first value should be matched with the first value only in the first row. – Erfan Sep 03 '19 at 15:35
I tried your solution with some edits, I can't get to the correct output unfortunately. – Erfan Sep 03 '19 at 15:43
Yes, I guess it would be tricky if you have ID or other columns that you need to retain. – ALollz Sep 03 '19 at 15:45

score 3 · Answer 4 · answered Sep 03 '19 at 15:22

3

Just split the dataframe into two

df1=df.dropna()
df2=df.drop(df1.index)

d1 = df1['Col1'].str.split(',').explode()
d2 = df1['Col2'].str.split('~').explode()
d3 = df2['Col1'].str.split(',').explode()

final = pd.concat([d1, d2], axis=1).append(d3.to_frame(),sort=False)
Out[77]: 
  Col1 Col2
0    a   aa
0    b   bb
0    c   cc
0    d   dd
2    i   ii
2    j   jj
2    k   kk
2    l   ll
2    m   mm
1    e  NaN
1    f  NaN
1    g  NaN
1    h  NaN

answered Sep 03 '19 at 15:22

BENY

317,841
20
164
234

This comes closest to what I want, the problem is that I have quite some of these columns – Erfan Sep 03 '19 at 15:33

Fill in same amount of characters where other column is NaN

4 Answers4

`pd.concat`

Same thing, different look

Linked