How to split a python dataframe based on new line characters?

Question

I have pandas dataframe in which a column contains paragraphs of text. I wanted to explode the dataframe into separate columns by splitting the paragraphs of text into newlines. The paragraph of text may contain multiple new lines.

Example dataframe:

Current output:
A
foo bar
foo bar\nfoo bar
foo bar
foo bar

Desired output:

   A         B                                                      
0 foo bar                                                  
1 foo bar   foo bar                                                 
2 foo bar                                                  
3 foo bar

I have tried using this:

df.A.str.split(expand=True))

But it is splitting at every whitespace not "/n" as expected.

score 2 · Answer 1 · answered Jul 25 '21 at 17:32

As stated in the docs you should be able to specify the delimiter to split on as the (optional) parameter of the split method par, otherwise it will split on whitespaces only:

"String or regular expression to split on. If not specified, split on whitespace."

Therefore you may do the following to achive the newline-splitting feature:

df.A.str.split(pat="\n", expand=True)

Arne · Answer 2 · 2021-07-25T18:21:31.403

1

You have to pass the pattern on which to split the string as an argument to series.str.split(). Here is a complete reproducible example that works on Windows systems:

import pandas as pd

df = pd.DataFrame({'A': ['foo bar', 
                         'foo bar\nfoo bar',
                         'foo bar',
                         'foo bar']})

df.A.str.split(pat='\n', expand=True)

    0           1
0   foo bar     None
1   foo bar     foo bar
2   foo bar     None
3   foo bar     None

For a platform-independent solution, I would do something similar to @ThePyGuy's answer, but with str.splitlines(), because this method will recognize line boundaries from various systems.

df.A.apply(str.splitlines).apply(pd.Series).fillna('')

edited Jul 25 '21 at 18:21

answered Jul 25 '21 at 17:23

Arne

9,990
2
18
28

Hmm, it does work for me (Python 3.9.5, pandas 1.2.5). What exactly happens when you try it? – Arne Jul 25 '21 at 17:33
It doesn't split – ThePyGuy Jul 25 '21 at 17:34
I've added a complete example. Does that run this way on your system? – Arne Jul 25 '21 at 17:42
No.. It doesn't – ThePyGuy Jul 25 '21 at 17:43
I've added a platform-independent solution. – Arne Jul 25 '21 at 18:22

score 0 · Answer 3 · answered Jul 25 '21 at 17:32

You can try following: Use python's native str.split apply on the column, then apply pd.Series to create multiple columns out of it.

>>> df.A.apply(lambda x: x.split(r'\n')).apply(pd.Series).fillna('')

         0        1
0  foo bar         
1  foo bar  foo bar
2  foo bar         
3  foo bar

Finally, you can just rename the columns.

How to split a python dataframe based on new line characters?

3 Answers3

Linked