Consider this simple example
import pandas as pd
df = pd.DataFrame({'col1' : [1,2,3],
'col2' : ['A','B','C'],
'paragraph': ['sentence one. sentence two',
'sentence three. and sentence four',
'crazy sentence!! and the final one.']})
df
Out[11]:
col1 col2 paragraph
0 1 A sentence one. sentence two
1 2 B sentence three. and sentence four
2 3 C crazy sentence!! and the final one.
I would like to split the paragaphs into sentences (using spacy
preferably) but I need to keep the information in the other columns.
I know how to explode the column and split (naively) on .
df.paragraph.str.split('.').explode()
Out[10]:
0 sentence one
0 sentence two
1 sentence three
1 and sentence four
2 crazy sentence!! and the final one
2
Name: paragraph, dtype: object
but this loses the information in col1
and col2
(those should be kept and repeated in the sentence-by-sentence dataframe) and does not split correctly the sentence with an exclamation mark.
Using Spacy and nlp(paragraph).sents
will still loses the two columns.
What can I do? Thanks!