How to merge consecutive rows whose columns are NaN

Question

I have this kind of data that it's driving me crazy. The source is a pdf file that I read with tabula to extract tables. Problem is that some rows of the table are multiline in the document and this is how I see my output.

> sub_df.iloc[85:95]
1      Acronym     Meaning
86      ABC        Aaaaa Bbbbb Ccccc
87      CDE        Ccccc Ddddd Eeeee
88      NaN        Fffff Ggggg 
89      FGH        NaN
90      NaN        Hhhhh
91      IJK        Iiiii Jjjjj Kkkkk
92      LMN        Lllll Mmmmm Nnnnn
93      OPQ        Ooooo Ppppp Qqqqq
94      RST        Rrrrr Sssss Ttttt
95      UVZ        Uuuuu Vvvvv Zzzzz

What I would like to get is something like this.

> sub_df.iloc[85:95]
1      Acronym     Meaning
86      ABC        Aaaaa Bbbbb Ccccc
87      CDE        Ccccc Ddddd Eeeee
88      FGH        Fffff Ggggg Hhhhh      
91      IJK        Iiiii Jjjjj Kkkkk
92      LMN        Lllll Mmmmm Nnnnn
93      OPQ        Ooooo Ppppp Qqqqq
94      RST        Rrrrr Sssss Ttttt
95      UVZ        Uuuuu Vvvvv Zzzzz

I am struggling with combine_first like this:

sub_df.iloc[[88]].combine_first(sub_df.iloc[[87]])

but the result is not what I am expecting.

Also a solution with groupby would be appreciated.

Note: index is not important and it can be reset. I just wanna join some consecutive rows whose columns are NaN and then dump it to csv, so I don't need them.

`NaNs` are excluded in `groupby` so a groupby solution probably wont work unless you `fillna` — It_is_Chris, Dec 19 '18 at 14:45

Scott Boston · Answer 1 · 2018-12-19T15:31:21.763

2

Let's try this:

df = df.assign(Meaning = df['Meaning'].ffill())
mask = ~((df.Meaning.duplicated(keep='last')) & df.Acronym.isnull())

df = df[mask]

df = df.assign(Acronym = df['Acronym'].ffill())

df_out = df.groupby('Acronym').apply(lambda x: ' '.join(x['Meaning'].str.split('\s').sum())).reset_index()

Output:

  Acronym                  0
0     ABC  Aaaaa Bbbbb Ccccc
1     CDE  Ccccc Ddddd Eeeee
2     FGH  Fffff Ggggg Hhhhh
3     IJK  Iiiii Jjjjj Kkkkk
4     LMN  Lllll Mmmmm Nnnnn
5     OPQ  Ooooo Ppppp Qqqqq
6     RST  Rrrrr Sssss Ttttt
7     UVZ  Uuuuu Vvvvv Zzzzz

edited Dec 19 '18 at 15:31

answered Dec 19 '18 at 14:29

Scott Boston

147,308
15
139
187

check the expected output for `88` - should be `Fffff Ggggg Hhhhh` – Chris Adams Dec 19 '18 at 14:32
No problem, I used the exact same method as you before realising that the output wasn't the desired one – Chris Adams Dec 19 '18 at 14:35

score 2 · Accepted Answer · answered Dec 19 '18 at 15:04

This is a pretty tricky question neither ffill and bfill will work for this question

s1=(~(df.Acronym.isnull()|df.Meaning.isnull())) # create the group
s=s1.astype(int).diff().ne(0).cumsum() # create the group for each bad line it will assign the single id 
bad=df[~s1]# we just only change the bad one 
good=df[s1]# keep the good one no change 


bad=bad.groupby(s.loc[bad.index]).agg({'1':'first','Acronym':'first','Meaning':lambda x : ''.join(x[x.notnull()])})


pd.concat([good,bad]).sort_index()
Out[107]: 
    1 Acronym            Meaning
0  86     ABC  Aaaaa Bbbbb Ccccc
1  87     CDE  Ccccc Ddddd Eeeee
2  88     FGH  Fffff Ggggg Hhhhh
5  91     IJK  Iiiii Jjjjj Kkkkk
6  92     LMN  Lllll Mmmmm Nnnnn
7  93     OPQ  Ooooo Ppppp Qqqqq
8  94     RST  Rrrrr Sssss Ttttt
9  95     UVZ  Uuuuu Vvvvv Zzzzz

That's some mastery of pandas! I just edited something because '1' is not a column, but the index so it would give a key error if copypasted and it should be `' ' .join` otherwise GggggHhhhh would be not spaced. `bad = bad.groupby(good_index.loc[bad.index]).agg({sub_df.columns[0]: 'first', sub_df.columns[1]: lambda x: ' '.join(x[x.notnull()])}); sub_df = pd.concat([good, bad]).reset_index(drop=True).sort_values(by=sub_df.columns[0])` — sparaflAsh, Dec 19 '18 at 16:10

score 2 · Answer 3 · answered Dec 19 '18 at 15:18

Here is an approach using numpy.where to do a conditional fill:

df['Acronym'] = np.where(df[['Acronym']].assign(Meaning=df.Meaning.shift()).isna().all(1),
                         df.Acronym.ffill(),
                         df.Acronym.bfill())

clean_meaning = df.dropna().groupby('Acronym')['Meaning'].apply(lambda x : ' '.join(x)).to_frame()

df_new = (df[['1', 'Acronym']]
          .drop_duplicates(subset=['Acronym'])
          .merge(clean_meaning,
                 left_on='Acronym',
                 right_index=True))

[out]

    1 Acronym            Meaning
0  86     ABC  Aaaaa Bbbbb Ccccc
1  87     CDE  Ccccc Ddddd Eeeee
2  88     FGH  Fffff Ggggg Hhhhh
5  91     IJK  Iiiii Jjjjj Kkkkk
6  92     LMN  Lllll Mmmmm Nnnnn
7  93     OPQ  Ooooo Ppppp Qqqqq
8  94     RST  Rrrrr Sssss Ttttt
9  95     UVZ  Uuuuu Vvvvv Zzzzz

How to merge consecutive rows whose columns are NaN

3 Answers3