0

I want to filter out p tag with a class from the HTML below, while not affecting any other p tags following it.

<article data-article-content>
    <div class="abc">
       <p class="ghf">Some Text</p>
       <p>Some other Text</p>
       <p>A different Text</p>
    </div>
</article>

What I am using:

def myFunction(result):
    result = re.sub(r'<article data-article-content><div class="abc"><p class="ghf">.*?</p><\/article>',' ',result)
    return result

I will call this function and printing that out should omit 'Some Text'. I am a beginner in regex. Please help with suggestions

Expected Output:

Some other Text A different Text

  • 5
    Welcome to Stack Overflow! Don't use regex to parse HTML. It's a [bad idea](https://stackoverflow.com/a/1732454/8967612). But [why not?](https://stackoverflow.com/a/590789/8967612) Here are [some examples](https://stackoverflow.com/a/18724992/8967612) of problems you might run into. Use an [HTML parser](https://stackoverflow.com/q/11709079/8967612) instead. – 41686d6564 stands w. Palestine Oct 25 '21 at 19:48
  • What are you trying to achieve with the given HTML. Maybe you can explain the use-case or context a bit more, so we can find a better solution than regex. Because regex might restrict your solution tremendously. – hc_dev Oct 25 '21 at 19:59
  • There is no `<\/article>` so of course the regex doesn't match. Replacing through the end of `` would obviously replace all the `

    ` nodes, not just the first one. Could you please [edit] to clarify what the expected result should be?

    – tripleee Oct 25 '21 at 20:04
  • 1
    Thank you for the suggestions. I am fetching the main body from websites. And I want to avoid scraping unwanted texts from it. I am trying BeautifulSoup now. I will update the question with more info. – Raheesh Kattumunda Muhamed Oct 26 '21 at 18:53

2 Answers2

1

Use BeautifulSoup. It's a fantastic HTML parser and it has a very intuitive API. I've used it hundreds of times for large and small projects.

html = '''
<article data-article-content>
    <div class="abc">
       <p class="ghf">Some Text</p>
       <p>Some other Text</p>
       <p>A different Text</p>
    </div>
</article>
'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html)

ps = soup.find_all('p')

p_with_class = [p for p in ps if p.get('class') is not None][0]

print(p_with_class)
# <p class="ghf">Some Text</p>

# Remove it.
p_with_class.decompose()

print(soup.prettify())

Output:

<html>
 <body>
  <article data-article-content="">
   <div class="abc">
    <p>
     Some other Text
    </p>
    <p>
     A different Text
    </p>
   </div>
  </article>
 </body>
</html>

More here.

0

With BeautifulSoup you could transform the given HTML so that

  • any <p> tag with class ghf is removed

Input

<article data-article-content>
    <div class="abc">
       <p class="ghf">Some Text</p>
       <p>Some other Text</p>
       <p>A different Text</p>
    </div>
</article>

Expected Output

<article data-article-content="">
<div class="abc">

<p>Some other Text</p>
<p>A different Text</p>
</div>
</article>

With BeautifulSoup

Here using BeautifulSoup, version 4, also known by acronym bs4.

Install using pip:

pip install beautifulsoup4

Then parse, find, modify and print:

from bs4 import BeautifulSoup

html = '''
<article data-article-content>
    <div class="abc">
       <p class="ghf">Some Text</p>
       <p>Some other Text</p>
       <p>A different Text</p>
    </div>
</article>
'''

soup = BeautifulSoup(html, features='html.parser') # parses HTML using python's internal HTML-parser

found_paragraphs = soup.find("p", {"class": "ghf"}) # find your element
found_paragraphs.extract() # removes and leaves an empty line

print(soup) # unfortunately indentation is lost

You could use prettify() on soup to get some indentation back.

See also

The functions used are explained in detail in related questions & answers:

hc_dev
  • 8,389
  • 1
  • 26
  • 38