how to explode a list of strings while keeping the other columns?

Question

Consider this simple example

import pandas as pd

df = pd.DataFrame({'col1' : [1,2,3],
                   'col2' : ['A','B','C'],
                   'paragraph': ['sentence one. sentence two',
                                 'sentence three. and sentence four',
                                 'crazy sentence!! and the final one.']})

df
Out[11]: 
   col1 col2                            paragraph
0     1    A           sentence one. sentence two
1     2    B    sentence three. and sentence four
2     3    C  crazy sentence!! and the final one.

I would like to split the paragaphs into sentences (using spacy preferably) but I need to keep the information in the other columns.

I know how to explode the column and split (naively) on .

df.paragraph.str.split('.').explode()
Out[10]: 
0                          sentence one
0                          sentence two
1                        sentence three
1                     and sentence four
2    crazy sentence!! and the final one
2                                      
Name: paragraph, dtype: object

but this loses the information in col1 and col2 (those should be kept and repeated in the sentence-by-sentence dataframe) and does not split correctly the sentence with an exclamation mark.

Using Spacy and nlp(paragraph).sents will still loses the two columns.

What can I do? Thanks!

mozway · Answer 1 · 2021-08-18T14:39:28.603

Do it in two steps:

df['paragraph'] = df['paragraph'].str.split('.')
df.explode('paragraph')

output:

   col1 col2                           paragraph
0     1    A                        sentence one
0     1    A                        sentence two
1     2    B                      sentence three
1     2    B                   and sentence four
2     3    C  crazy sentence!! and the final one
2     3    C

To split on both ./!:

df['paragraph'] = df['paragraph'].str.split('[.!]+')
df.explode('paragraph')

   col1 col2           paragraph
0     1    A        sentence one
0     1    A        sentence two
1     2    B      sentence three
1     2    B   and sentence four
2     3    C      crazy sentence
2     3    C   and the final one
2     3    C

thanks. This works but does not split the sentence with the !. I think for this `spacy` is needed — ℕʘʘḆḽḘ, Aug 18 '21 at 14:37

score 1 · Answer 2 · answered Aug 18 '21 at 14:57

Define your extend as a separate temporary dataframe and use it to join the main dataframe.

import pandas as pd
df = pd.DataFrame({'col1' : [1,2,3],
               'col2' : ['A','B','C'],
               'paragraph': ['sentence one. sentence two',
                             'sentence three. and sentence four',
                             'crazy sentence!! and the final one.']})
alfa = pd.DataFrame(df.paragraph.str.split('.').explode())
alfa.rename(columns={"paragraph":"paragraphSplit"},inplace=True)
alfa.join(df)

paragraphSplit	col1	col2	paragraph
sentence one	1	A	sentence one. sentence two
sentence two	1	A	sentence one. sentence two
sentence three	2	B	sentence three. and sentence four
and sentence four	2	B	sentence three. and sentence four
crazy sentence!! and the final one	3	C	crazy sentence!! and the final one.
	3	C	crazy sentence!! and the final one.

Hope this is what you are looking for.

score 1 · Accepted Answer · answered Aug 18 '21 at 14:58

If you prefer SpaCy to split text into sentences, use

import spacy
from spacy.lang.en import English
nlp = English()
nlp.add_pipe('sentencizer')

def split_in_sentences(text):
    return [sent.text.strip() for sent in nlp(text).sents]

df['paragraph'] = df['paragraph'].apply(split_in_sentences)
>>> df.explode('paragraph')
   col1 col2           paragraph
0     1    A       sentence one.
0     1    A        sentence two
1     2    B     sentence three.
1     2    B   and sentence four
2     3    C    crazy sentence!!
2     3    C  and the final one.

See the How to break up document by sentences with Spacy SO post.

how to explode a list of strings while keeping the other columns?

3 Answers3