Remove newline characters from pandas series of lists

Question

I have a pandas DataFrame that contains two columns, one of tags containing numbers and the other with a list containing string elements.

Dataframe:

df = pd.DataFrame({
   'tags': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1}, 
    'elements': {
        0: ['\n☒', '\nANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 '],
        1: ['', ''],
        2: ['\n', '\nFor the Fiscal Year Ended June 30, 2020'],
        3: ['\n', '\n'],
        4: ['\n', '\nOR']
    }
})

I am trying to remove all instances of \n from any element in all the lists from the column elements but I'm really struggling to do so. My solution was to use a nested loop and re.sub() to trying and replace these but it has done nothing (granted this is a horrible solution). This was my attempt:


for ls in range(len(page_table.elements)):
    for st in range(len(page_table.elements[i])):
        page_table.elements[i][st] = re.sub('\n', '', page_table.elements[i][st])

Is there a way to do this?

Please read this: [how-to-make-good-reproducible-pandas-examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) e.g. use `df[['tags', 'elements']].iloc[:5].to_dict()` — Andreas, Aug 25 '21 at 14:32
You'll need to [`explode`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.explode.html) (maybe many times, depending on nesting) and then [`replace`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html) the `\n` values. But you should update this to be a better example that can be copied and pasted easily. — Alex, Aug 25 '21 at 14:34
@geds133, thank you, unfortunatly they sample data is not copy pastable, because the html is not in quotes. So it is difficult to reproduce your problem. I would suggest that you (manually?) create simplified copy/pastable sample data as input and the expected output for it. Otherwise we need more time to prepare sample data then to solve the question itself. — Andreas, Aug 25 '21 at 14:42
@geds133 that dict doesn't match the lists that were in the `tags` column in your first revision — Alex, Aug 25 '21 at 14:42
@Andreas the column `tags` is not the important column here and so I have dropped it from the output. The focus here is cleaning the lists in elements. — geds133, Aug 25 '21 at 14:50
@Alex tags here could be anything. Take an example of a series with a single integer in each row. The focus here is on `elements` — geds133, Aug 25 '21 at 14:52
i've hopefully added the construction of the df to the bottom of the post @geds133 pls check it. — MDR, Aug 25 '21 at 14:52
@Andreas I have updated the output to be a much simpler version. Hope this helps — geds133, Aug 25 '21 at 15:01
@Alex I have updated the output to be a much simpler version. Hope this helps — geds133, Aug 25 '21 at 15:01

score 1 · Accepted Answer · answered Aug 25 '21 at 15:03

You can explode and then replace the \n values.
You can leave out the .groupby(level=0).agg(list) to not put them back into lists, though this will have a different shape to the original DataFrame.

df["elements"] = (
    df["elements"]
    .explode()
    .str.replace(r"\n", "", regex=True)
    .groupby(level=0)
    .agg(list)
)

Which outputs:

0    [☒, ANNUAL REPORT PURSUANT TO SECTION 13 OR 15...
1                                                 [, ]
2          [, For the Fiscal Year Ended June 30, 2020]
3                                                 [, ]
4                                               [, OR]

score 1 · Answer 2 · answered Aug 25 '21 at 15:08

Also possible:

df['elements'] = df['elements'].map(lambda x: [y.replace('\n', '') for y in x])


0    [☒, ANNUAL REPORT PURSUANT TO SECTION 13 OR 15...
1                                                 [, ]
2          [, For the Fiscal Year Ended June 30, 2020]
3                                                 [, ]
4                                               [, OR]

Remove newline characters from pandas series of lists

2 Answers2