How do I remove the
tag while cleaning rss xml data?

Asked Mar 31 '23 at 21:53

Active Mar 31 '23 at 22:04

Viewed 20 times

I am cleaning rss feed data that I pulled using feedparser. I managed to remove all special characters but I am unable to remove the "p" from the tag <p>. How can I remove this?

I tried this code:

def clean_text(text):
    return [re.sub('[^a-z0-9]', '', w.lower()) for w in text.strip().split()]


news_df['clean_body'] = news_df['summary'].apply(clean_text)

It successfully executed this but the tag <p> is not fully removed because the p is remaining.

edited Mar 31 '23 at 22:04

Marcelo Paco

2,732
4
9
26

asked Mar 31 '23 at 21:53

Libra

First, [never use regex with html (or xml)](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454); use an html parser. Second, please edit your question and add a short, representative sample of your html, before and after. – Jack Fleeting Apr 01 '23 at 11:18

How do I remove the tag while cleaning rss xml data?

0 Answers0

How do I remove the
tag while cleaning rss xml data?