Python regular Expression to get text between two strings

Question

when i read a text, i have string like <h3 class="heading">General Purpose</h3> in some of the lines of the text, now i want to get only value that is General Purpose from above..

d = re.search(re.escape('<h3 class="heading">')+"(.*?)"+re.escape('</h3>'), str(data2))
if d:
    print(d.group(0))

Can you make your question more clear? Include data2 in your question and also mention what are you trying to extract from data2. — Mohammad Yusuf, Nov 15 '16 at 05:38
Is this an example string, or do you actually have HTML? http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — OneCricketeer, Nov 15 '16 at 06:19
I think you want d.group(1). 0 is the whole matched string, 1 is the first parenthesized group. — roarsneer, Nov 15 '16 at 06:19

Mohammad Yusuf · Answer 1 · 2016-11-15T07:09:36.227

4

import re

text="""<h3 class="heading">General Purpose</h3>"""
pattern="(<.*?>)(.*)(<.*?>)"

g=re.search(pattern,text)
g.group(2)

Output:

'General Purpose'

Demo on Regex101

If its a beautiful soup object then its even simpler to get the value. You wont need the regex.

from bs4 import BeautifulSoup

text="""<h3 class="heading">General Purpose</h3>"""
a=BeautifulSoup(text)
print a.select('h3.heading')[0].text

Output:

General Purpose

edited Nov 15 '16 at 07:09

answered Nov 15 '16 at 06:28

Mohammad Yusuf

16,554
10
50
78

If its already a beautifulsoup object then you don't have to use additional regex to extract the data. You can use beautifulsoup methods to extract the html data. – Mohammad Yusuf Nov 15 '16 at 07:24
@kattaprasanth: I wrote my answer before your comment that you're using BeautifulSoup. In that case, please remove the "accepted" checkmark from my answer and give it to this one because it's clearly the better one. – Tim Pietzcker Nov 15 '16 at 16:43
@TimPietzcker .. Actually for that, beautifulsoup was returning None.. now its working i am using tbody to get the required output... thanks once again – kattaprasanth Nov 16 '16 at 13:55

score 1 · Accepted Answer · answered Nov 15 '16 at 06:19

1

Group 0 contains the entire match; you want the contents of group 1:

print(d.group(1))

But generally, using regexes to parse HTML is not such a good idea (although practically speaking, nested h3 tags should be rather uncommon).

answered Nov 15 '16 at 06:19

Tim Pietzcker

328,213
58
503
561

score 1 · Answer 3 · answered Nov 15 '16 at 08:04

1

Warning: works ONLY IN python, NOT pcre or JS (Lookbehind is not supported in JS).

(?<=\<\h3 class=\"heading\"\>).*?(?=\<\/h3\>)

answered Nov 15 '16 at 08:04

danielpopa

810
14
27

Python regular Expression to get text between two strings

3 Answers3