2

I have a pandas DataFrame that contains two columns, one of tags containing numbers and the other with a list containing string elements.

Dataframe:

df = pd.DataFrame({
   'tags': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1}, 
    'elements': {
        0: ['\n☒', '\nANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 '],
        1: ['', ''],
        2: ['\n', '\nFor the Fiscal Year Ended June 30, 2020'],
        3: ['\n', '\n'],
        4: ['\n', '\nOR']
    }
})

I am trying to remove all instances of \n from any element in all the lists from the column elements but I'm really struggling to do so. My solution was to use a nested loop and re.sub() to trying and replace these but it has done nothing (granted this is a horrible solution). This was my attempt:


for ls in range(len(page_table.elements)):
    for st in range(len(page_table.elements[i])):
        page_table.elements[i][st] = re.sub('\n', '', page_table.elements[i][st])

Is there a way to do this?

Alex
  • 6,610
  • 3
  • 20
  • 38
geds133
  • 1,503
  • 5
  • 20
  • 52
  • Please read this: [how-to-make-good-reproducible-pandas-examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) e.g. use `df[['tags', 'elements']].iloc[:5].to_dict()` – Andreas Aug 25 '21 at 14:32
  • You'll need to [`explode`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.explode.html) (maybe many times, depending on nesting) and then [`replace`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html) the `\n` values. But you should update this to be a better example that can be copied and pasted easily. – Alex Aug 25 '21 at 14:34
  • @Andreas please find the updated output – geds133 Aug 25 '21 at 14:38
  • @geds133, thank you, unfortunatly they sample data is not copy pastable, because the html is not in quotes. So it is difficult to reproduce your problem. I would suggest that you (manually?) create simplified copy/pastable sample data as input and the expected output for it. Otherwise we need more time to prepare sample data then to solve the question itself. – Andreas Aug 25 '21 at 14:42
  • @geds133 that dict doesn't match the lists that were in the `tags` column in your first revision – Alex Aug 25 '21 at 14:42
  • @Andreas the column `tags` is not the important column here and so I have dropped it from the output. The focus here is cleaning the lists in elements. – geds133 Aug 25 '21 at 14:50
  • @Alex tags here could be anything. Take an example of a series with a single integer in each row. The focus here is on `elements` – geds133 Aug 25 '21 at 14:52
  • i've hopefully added the construction of the df to the bottom of the post @geds133 pls check it. – MDR Aug 25 '21 at 14:52
  • 1
    @Andreas I have updated the output to be a much simpler version. Hope this helps – geds133 Aug 25 '21 at 15:01
  • @Alex I have updated the output to be a much simpler version. Hope this helps – geds133 Aug 25 '21 at 15:01

2 Answers2

1

You can explode and then replace the \n values.
You can leave out the .groupby(level=0).agg(list) to not put them back into lists, though this will have a different shape to the original DataFrame.

df["elements"] = (
    df["elements"]
    .explode()
    .str.replace(r"\n", "", regex=True)
    .groupby(level=0)
    .agg(list)
)

Which outputs:

0    [☒, ANNUAL REPORT PURSUANT TO SECTION 13 OR 15...
1                                                 [, ]
2          [, For the Fiscal Year Ended June 30, 2020]
3                                                 [, ]
4                                               [, OR]
Alex
  • 6,610
  • 3
  • 20
  • 38
1

Also possible:

df['elements'] = df['elements'].map(lambda x: [y.replace('\n', '') for y in x])


0    [☒, ANNUAL REPORT PURSUANT TO SECTION 13 OR 15...
1                                                 [, ]
2          [, For the Fiscal Year Ended June 30, 2020]
3                                                 [, ]
4                                               [, OR]
Andreas
  • 8,694
  • 3
  • 14
  • 38