0

I am cleaning rss feed data that I pulled using feedparser. I managed to remove all special characters but I am unable to remove the "p" from the tag <p>. How can I remove this?

I tried this code:

def clean_text(text):
    return [re.sub('[^a-z0-9]', '', w.lower()) for w in text.strip().split()]


news_df['clean_body'] = news_df['summary'].apply(clean_text)

It successfully executed this but the tag <p> is not fully removed because the p is remaining.

Marcelo Paco
  • 2,732
  • 4
  • 9
  • 26
Libra
  • 1
  • 1
  • First, [never use regex with html (or xml)](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454); use an html parser. Second, please edit your question and add a short, representative sample of your html, before and after. – Jack Fleeting Apr 01 '23 at 11:18

0 Answers0