2

when i read a text, i have string like <h3 class="heading">General Purpose</h3> in some of the lines of the text, now i want to get only value that is General Purpose from above..

d = re.search(re.escape('<h3 class="heading">')+"(.*?)"+re.escape('</h3>'), str(data2))
if d:
    print(d.group(0))
Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
kattaprasanth
  • 35
  • 1
  • 9

3 Answers3

4
import re

text="""<h3 class="heading">General Purpose</h3>"""
pattern="(<.*?>)(.*)(<.*?>)"

g=re.search(pattern,text)
g.group(2)

Output:

'General Purpose'

Demo on Regex101

If its a beautiful soup object then its even simpler to get the value. You wont need the regex.

from bs4 import BeautifulSoup

text="""<h3 class="heading">General Purpose</h3>"""
a=BeautifulSoup(text)
print a.select('h3.heading')[0].text

Output:

General Purpose
Mohammad Yusuf
  • 16,554
  • 10
  • 50
  • 78
  • If its already a beautifulsoup object then you don't have to use additional regex to extract the data. You can use beautifulsoup methods to extract the html data. – Mohammad Yusuf Nov 15 '16 at 07:24
  • @kattaprasanth: I wrote my answer before your comment that you're using BeautifulSoup. In that case, please remove the "accepted" checkmark from my answer and give it to this one because it's clearly the better one. – Tim Pietzcker Nov 15 '16 at 16:43
  • @TimPietzcker .. Actually for that, beautifulsoup was returning None.. now its working i am using tbody to get the required output... thanks once again – kattaprasanth Nov 16 '16 at 13:55
1

Group 0 contains the entire match; you want the contents of group 1:

print(d.group(1))

But generally, using regexes to parse HTML is not such a good idea (although practically speaking, nested h3 tags should be rather uncommon).

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
1

Warning: works ONLY IN python, NOT pcre or JS (Lookbehind is not supported in JS).

(?<=\<\h3 class=\"heading\"\>).*?(?=\<\/h3\>)
danielpopa
  • 810
  • 14
  • 27