3

I have tried looking at other merge rows in pandas solutions here and here and especially the solution here.

I want to combine the individual sentences scraped from bullet points into one paragraph between the empty blank rows. BUT keep the blank rows as they are. I want to keep the first sentence's paragraph id as the new id. (Paragraph ids are not necessarily continuous as there was some pre-cleaning done.)

df = pd.DataFrame(data = [[1, "A Heading"],
                          [2, "I need to be with above."],
                          [3, ""],
                          [8, "I stand alone."],
                          [9, ""]],columns=['para_id','text'])

df   
# The data
#    para_id                      text
# 0        1                 A Heading
# 1        2  I need to be with above.
# 2        3                          
# 3        8            I stand alone.
# 4        9                         

My required output is:

#    para_id                                    text
# 0        1  A Heading. I need to be with above
# 1        3  
# 2        8  I stand alone.
# 3        9   

With the help of the answers I am close and just need a little more guidance.

Attempted solution

df['t'] =  df['text'].str.len().values
s = df['t'].eq(0).cumsum()
out = df.groupby(s).agg({'para_id': 'first',
                         'text': lambda x: '. '.join(x),
                         't': 'last'})
out.drop('t', inplace=True, axis=1)
   
out
# Incorrect output
#    para_id                                 text
# t                                              
# 0        1  A Heading. I need to be with above.
# 1        3                     . I stand alone.
# 2        9

I almost have it working but my blank lines are getting glued to some text. And so I am missing my first blank row.

  1. Please can someone help me formulate s better to get the desired output.

  2. I also need the joining . to only occur if there is not a fullstop at the end of last sentence. (This is not critical. I guess I could first search for missing fullstops at end of non-empty text sentences, and then join sentences but I was wondering if there was a mutate if kind of structure in pandas.)

micstr
  • 5,080
  • 8
  • 48
  • 76

1 Answers1

2

You're almost there, just groupby on both the non-zero lengths and cumsum:

s = df['text'].eq('')

(df.groupby([s.cumsum(),s], sort=False)
   .agg({'para_id':'first', 'text': '. '.join})
   .reset_index(drop=True)
)

Output:

   para_id                                 text
0        1  A Heading. I need to be with above.
1        3                                     
2        8                       I stand alone.
3        9                                     
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
  • Thanks Quang for a **speedy** answer. Can you just explain the `cumsum` a bit more - is that to get the row groupings and then straight `s` cuts out the blank rows? Plus any pointers on handling the missing full stops? – micstr Nov 25 '21 at 15:47
  • 1
    @micstr 1) `cumsum` gathers the rows into groups that starts with a blank row. Print `s.cumsum()` and you can see the grouping. 2) yes, adding `s` separates the blank rows from other row in `s.cumsum()` groups. 3) Regarding the fullstop, maybe add another condition `s1 = df['text'].str.endswith('\.').cumsum()` into groupby. – Quang Hoang Nov 25 '21 at 15:52