4

I have pandas dataframe in which a column contains paragraphs of text. I wanted to explode the dataframe into separate columns by splitting the paragraphs of text into newlines. The paragraph of text may contain multiple new lines.

Example dataframe:

Current output:
A
foo bar
foo bar\nfoo bar
foo bar
foo bar

Desired output:

   A         B                                                      
0 foo bar                                                  
1 foo bar   foo bar                                                 
2 foo bar                                                  
3 foo bar                                                  

I have tried using this:

df.A.str.split(expand=True))

But it is splitting at every whitespace not "/n" as expected.

Alex
  • 6,610
  • 3
  • 20
  • 38
Adam Choy
  • 41
  • 1
  • 3

3 Answers3

2

As stated in the docs you should be able to specify the delimiter to split on as the (optional) parameter of the split method par, otherwise it will split on whitespaces only:

"String or regular expression to split on. If not specified, split on whitespace."

Therefore you may do the following to achive the newline-splitting feature:

df.A.str.split(pat="\n", expand=True)
Drumstick
  • 21
  • 3
1

You have to pass the pattern on which to split the string as an argument to series.str.split(). Here is a complete reproducible example that works on Windows systems:

import pandas as pd

df = pd.DataFrame({'A': ['foo bar', 
                         'foo bar\nfoo bar',
                         'foo bar',
                         'foo bar']})

df.A.str.split(pat='\n', expand=True)
    0           1
0   foo bar     None
1   foo bar     foo bar
2   foo bar     None
3   foo bar     None

For a platform-independent solution, I would do something similar to @ThePyGuy's answer, but with str.splitlines(), because this method will recognize line boundaries from various systems.

df.A.apply(str.splitlines).apply(pd.Series).fillna('')
Arne
  • 9,990
  • 2
  • 18
  • 28
0

You can try following: Use python's native str.split apply on the column, then apply pd.Series to create multiple columns out of it.

>>> df.A.apply(lambda x: x.split(r'\n')).apply(pd.Series).fillna('')

         0        1
0  foo bar         
1  foo bar  foo bar
2  foo bar         
3  foo bar         

Finally, you can just rename the columns.

ThePyGuy
  • 17,779
  • 5
  • 18
  • 45