Extracting a HTML tag from a string

Question

Can anyone please explain how would I extract a substring from an input string.

Input:

'<h3>freedom machines.</h3><p>dom.</p><br/><p>The robust display.</p>'

Output :

'<h3>freedom machines.</h3>'

I am trying to do it with regex, but no luck. Do you have any suggestions?

I need to check whether my string starts from any header tag (<h1>, <h2> or <h3>), and if it does, I will extract that header tag.

I tried with startswith, but with no success:

if input.startswith("<h"):
  ** Code to extract that h tag.

whats the pattern your trying to extract ? Please be more specific — sharath, Jul 04 '17 at 13:36
use [beautiful soup](https://www.crummy.com/software/BeautifulSoup/) or [elementtree](https://docs.python.org/3/library/xml.etree.elementtree.html) to parse (x)html. [never regex](https://stackoverflow.com/a/1732454/4954037). — hiro protagonist, Jul 04 '17 at 13:36
I am already using the BeautifulSoup, but, my requirement is little bit different. That's why I had to convert the soup output to a str type. — Aman Saraf, Jul 04 '17 at 13:38

Aaron · Answer 1 · 2017-07-04T13:40:42.330

2

You can use re.search to extract the text between the <h3> tag.

The <h3>.*?</h3> pattern means match anything between the <h3> tag.

>>> import re
>>> text = '<h3>freedom machines.</h3><p>dom.</p><br/><p>The robust display.</p>'
>>> match = re.search("<h3>.*?</h3>", text, re.IGNORECASE | re.MULTILINE)
>>> print(match.group())

'<h3>freedom machines.</h3>'

edited Jul 04 '17 at 13:40

answered Jul 04 '17 at 13:37

Aaron

2,383
3
22
53

The OP needs to check whether his string starts from *any* header tag – Thierry Lathuille Jul 04 '17 at 13:39
Hey Thanks @Aaron. Suppose, If dont know what type of header tag is used. It can be h1,h2 or h3. So, what could be done in that case ? – Aman Saraf Jul 04 '17 at 13:39
@user3476378 Then try use `.*?`, the `\d` means a digit. – Aaron Jul 04 '17 at 13:42
1

@Aaron . Thanks. It did solved my issue. Thanks again. – Aman Saraf Jul 04 '17 at 13:48
1

That is unfortunately not scalable to a whole html document, but this is rather neat. I'm always impressed by regular expressions... – Right leg Jul 04 '17 at 13:51

Right leg · Answer 2 · 2017-07-05T08:03:41.937

1

With BeautifulSoup:

html = '<h3>freedom machines.</h3><p>dom.</p><br/><p>The robust display.</p>'
soup = BeautifulSoup(html)
text = soup.find("h3").string

This is a basic use of BeautifulSoup. Create a BeautifulSoup object with your string as parameter. Then use its find method to find the tag with the name you're looking for. Finally, get the text the tag surrounds with its string attribute.

If you know that your text is in a <h1>, <h2> or <h3> but you don't know which, just try all of them. You can even check the three at once:

tag = soup.find("h1") or soup.find("h2") or soup.find("h3")
text = tag.string

The or operator will return the first member that evaluates as True Boolean-wise. In this case, it means the first soup.find result that is not None. The find method accepts an iterator as well, so you can pass it a static tuple. The result will be a tag object (if any) that matches any of the asked types.

tag = soup.find(("h1", "h2", "h3"))

Of course, it is better to know exactly in advance what tag will contain what you want... If there are both <h1> and <h2> tags on the page, you won't know what to do...

edited Jul 05 '17 at 08:03

answered Jul 04 '17 at 13:39

Right leg

16,080
7
48
81

Already did this. But, I need to extract only the header tag. I may not know which tag can be there. It can be h3, h2 or h1 . – Aman Saraf Jul 04 '17 at 13:41
2

@user3476378 That detail needs to be in your question. Anyway, editing my answer. – Right leg Jul 04 '17 at 13:41
Did that. Thanks for letting me know.! – Aman Saraf Jul 04 '17 at 13:42
You can use a list in `find`, eg : `tag = soup.find(["h1", "h2", "h3")` – t.m.adam Jul 05 '17 at 05:53
@t.m.adam If you really want to do so, use a tuple instead. You don't need a list because it's static, and a tuple is smaller in memory. It's almost nothing, but better style-wise. – Right leg Jul 05 '17 at 07:59
Yes, you can use any iterable, it's up to you. – t.m.adam Jul 05 '17 at 08:05

Extracting a HTML tag from a string

2 Answers2