-1

Can anyone please explain how would I extract a substring from an input string.

Input:

'<h3>freedom machines.</h3><p>dom.</p><br/><p>The robust display.</p>'

Output :

'<h3>freedom machines.</h3>'

I am trying to do it with regex, but no luck. Do you have any suggestions?

I need to check whether my string starts from any header tag (<h1>, <h2> or <h3>), and if it does, I will extract that header tag.

I tried with startswith, but with no success:

if input.startswith("<h"):
  ** Code to extract that h tag.
Right leg
  • 16,080
  • 7
  • 48
  • 81
Aman Saraf
  • 537
  • 8
  • 13
  • 3
    whats the pattern your trying to extract ? Please be more specific – sharath Jul 04 '17 at 13:36
  • 1
    use [beautiful soup](https://www.crummy.com/software/BeautifulSoup/) or [elementtree](https://docs.python.org/3/library/xml.etree.elementtree.html) to parse (x)html. [never regex](https://stackoverflow.com/a/1732454/4954037). – hiro protagonist Jul 04 '17 at 13:36
  • I am already using the BeautifulSoup, but, my requirement is little bit different. That's why I had to convert the soup output to a str type. – Aman Saraf Jul 04 '17 at 13:38

2 Answers2

2

You can use re.search to extract the text between the <h3> tag.

The <h3>.*?</h3> pattern means match anything between the <h3> tag.

>>> import re
>>> text = '<h3>freedom machines.</h3><p>dom.</p><br/><p>The robust display.</p>'
>>> match = re.search("<h3>.*?</h3>", text, re.IGNORECASE | re.MULTILINE)
>>> print(match.group())

'<h3>freedom machines.</h3>'
Aaron
  • 2,383
  • 3
  • 22
  • 53
1

With BeautifulSoup:

html = '<h3>freedom machines.</h3><p>dom.</p><br/><p>The robust display.</p>'
soup = BeautifulSoup(html)
text = soup.find("h3").string

This is a basic use of BeautifulSoup. Create a BeautifulSoup object with your string as parameter. Then use its find method to find the tag with the name you're looking for. Finally, get the text the tag surrounds with its string attribute.

If you know that your text is in a <h1>, <h2> or <h3> but you don't know which, just try all of them. You can even check the three at once:

tag = soup.find("h1") or soup.find("h2") or soup.find("h3")
text = tag.string

The or operator will return the first member that evaluates as True Boolean-wise. In this case, it means the first soup.find result that is not None. The find method accepts an iterator as well, so you can pass it a static tuple. The result will be a tag object (if any) that matches any of the asked types.

tag = soup.find(("h1", "h2", "h3"))

Of course, it is better to know exactly in advance what tag will contain what you want... If there are both <h1> and <h2> tags on the page, you won't know what to do...

Right leg
  • 16,080
  • 7
  • 48
  • 81