2

Consider this simple example

import pandas as pd

df = pd.DataFrame({'col1' : [1,2,3],
                   'col2' : ['A','B','C'],
                   'paragraph': ['sentence one. sentence two',
                                 'sentence three. and sentence four',
                                 'crazy sentence!! and the final one.']})

df
Out[11]: 
   col1 col2                            paragraph
0     1    A           sentence one. sentence two
1     2    B    sentence three. and sentence four
2     3    C  crazy sentence!! and the final one.

I would like to split the paragaphs into sentences (using spacy preferably) but I need to keep the information in the other columns.

I know how to explode the column and split (naively) on .

df.paragraph.str.split('.').explode()
Out[10]: 
0                          sentence one
0                          sentence two
1                        sentence three
1                     and sentence four
2    crazy sentence!! and the final one
2                                      
Name: paragraph, dtype: object

but this loses the information in col1 and col2 (those should be kept and repeated in the sentence-by-sentence dataframe) and does not split correctly the sentence with an exclamation mark.

Using Spacy and nlp(paragraph).sents will still loses the two columns.

What can I do? Thanks!

ℕʘʘḆḽḘ
  • 18,566
  • 34
  • 128
  • 235

3 Answers3

1

Do it in two steps:

df['paragraph'] = df['paragraph'].str.split('.')
df.explode('paragraph')

output:

   col1 col2                           paragraph
0     1    A                        sentence one
0     1    A                        sentence two
1     2    B                      sentence three
1     2    B                   and sentence four
2     3    C  crazy sentence!! and the final one
2     3    C                                    

To split on both ./!:

df['paragraph'] = df['paragraph'].str.split('[.!]+')
df.explode('paragraph')
   col1 col2           paragraph
0     1    A        sentence one
0     1    A        sentence two
1     2    B      sentence three
1     2    B   and sentence four
2     3    C      crazy sentence
2     3    C   and the final one
2     3    C                    
mozway
  • 194,879
  • 13
  • 39
  • 75
1

Define your extend as a separate temporary dataframe and use it to join the main dataframe.

import pandas as pd
df = pd.DataFrame({'col1' : [1,2,3],
               'col2' : ['A','B','C'],
               'paragraph': ['sentence one. sentence two',
                             'sentence three. and sentence four',
                             'crazy sentence!! and the final one.']})
alfa = pd.DataFrame(df.paragraph.str.split('.').explode())
alfa.rename(columns={"paragraph":"paragraphSplit"},inplace=True)
alfa.join(df)
paragraphSplit col1 col2 paragraph
sentence one 1 A sentence one. sentence two
sentence two 1 A sentence one. sentence two
sentence three 2 B sentence three. and sentence four
and sentence four 2 B sentence three. and sentence four
crazy sentence!! and the final one 3 C crazy sentence!! and the final one.
3 C crazy sentence!! and the final one.

Hope this is what you are looking for.

Rajiv2806
  • 88
  • 4
1

If you prefer SpaCy to split text into sentences, use

import spacy
from spacy.lang.en import English
nlp = English()
nlp.add_pipe('sentencizer')

def split_in_sentences(text):
    return [sent.text.strip() for sent in nlp(text).sents]

df['paragraph'] = df['paragraph'].apply(split_in_sentences)
>>> df.explode('paragraph')
   col1 col2           paragraph
0     1    A       sentence one.
0     1    A        sentence two
1     2    B     sentence three.
1     2    B   and sentence four
2     3    C    crazy sentence!!
2     3    C  and the final one.

See the How to break up document by sentences with Spacy SO post.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563