1

I'm unable to split a text document into multiple sentences in a data frame and create rows for each sentence while other column values should be the same. Let me explain it a little:

Initially

A | B     | C | D
-------------
x | A.B   | x | x
y | C.D.E | y | y

What I would like to have (after splitting text in B column)

A | B | C | D
-------------
x | A | x | x
x | B | x | x
y | C | y | y
y | D | y | y
y | E | y | y

What I have done so far?

I've managed to split the text document into different sentences using split() method. Now I'm stuck on the second part.

Help would be highly appreciated.

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
Qasim Khan
  • 85
  • 7

1 Answers1

4

use str.split('.') and explode():

str.split('.') returns a list in preparation for explode. Explode requires list format in a column in order for that to work.

df['B'] = df['B'].str.split('.')
df
Out[10]: 
   A          B  C  D
0  x     [A, B]  x  x
1  y  [C, D, E]  y  y

Then explode the list, passing the column as a parameter, indicating that you want to explode the dataframe according to that column:

df['B'] = df['B'].str.split('.')
df = df.explode('B')
df
Out[11]: 
   A  B  C  D
0  x  A  x  x
0  x  B  x  x
1  y  C  y  y
1  y  D  y  y
1  y  E  y  y
David Erickson
  • 16,433
  • 2
  • 19
  • 35