How to extract text from a html table row

Question

This is my string :

content = '<tr class="cart-subtotal"><th>RTO / Registration office :</th><td><span class="amount"><h5>Yadgiri</h5></span></td></tr>'

I have tried below regular expression to extract the text which is in between h5 element tag:

   reg = re.search(r'<tr class="cart-subtotal"><th>RTO / Registration office :</th><td><span class="amount"><h5>([A-Za-z0-9%s]+)</h5></span></td></tr>' % string.punctuation,content)

It's exactly returns what I wants.

Is there any more pythonic way to get this one ?

i want in regular expression instead of beautifulsoup and scrapy. — Veera Balla Deva, Jan 18 '18 at 12:30
Do ***NOT*** use regex for parsing html/xml/tag-style data. See [here](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — James, Jan 18 '18 at 12:33

score 2 · Accepted Answer · answered Jan 18 '18 at 12:34

2

Dunno whether this qualifies as more pythonic or not, but it handles it as HTML data.

from lxml import html
content = '<tr class="cart-subtotal"><th>RTO / Registration office :</th><td><span class="amount"><h5>Yadgiri</h5></span></td></tr>'
HtmlData = html.fromstring(content)
ListData = HtmlData.xpath(‘//text()’)

And to get the last element:

ListData[-1]

answered Jan 18 '18 at 12:34

Srevilo

174
1
11

1

To install on a Debian based system use python3-lxml – Srevilo Jan 18 '18 at 12:35

How to extract text from a html table row

1 Answers1