How to get plain text between two regex patterns python

Question

I want to extract the text contents of <div id="UserInputtedText">......</div>

note that < > & " are their html equivalents :(.

here is the code I made:


import re
 response='[<td width="100%" class="wrapText device-width" valign="top" style="overflow: hidden; border-collapse: collapse !important; border-spacing: 0 !important; border: none; display: inline-block; max-width:600px;"><h3 style="font-family: Helvetica, Arial, sans-serif; font-weight: normal; line-height: 19px; color: #231f20; text-align: left; font-size: 14px; margin: 0 0 2px; font-weight:none;" align="left"><div id="UserInputtedText">Hi Dear ,<br /><br />we hope you enjoy your shopping with us !<br />please leave us a positive feedback on the feedback section on your purchase history<br />You can click the button next to the item and leave a feedback there we will REALLY appreciate that !<br />Have a Great Day &amp;amp; STAY SAFE !</div></h3>]'

pattern 1= (\w+\s\w+[=][&]\w+[;]\w+[&]\w+[;][&]\w+[;])
# this is pattern 1 : div id="UserInputtedText">

pattern 2 =([&]\w+[;][/]\w+[&]\w+[;][&]\w+[;][/]\w+[&]\w+[;])
# this is pattern 2 : </div></h3>

pattern=re.search(r'(\w+\s\w+[=][&]\w+[;]\w+[&]\w+[;][&]\w+[;])(.*)([&]\w+[;][/]\w+[&]\w+[;][&]\w+[;][/]\w+[&]\w+[;])',response)

print(pattern.group(2))

https://stackoverflow.com/a/1732454/548562, use something like BeautifulSoup instead of regex — Iain Shelvington, Aug 23 '20 at 23:17

score 0 · Answer 1 · answered Aug 24 '20 at 00:01

There are two ways to approach this:

What you're trying to parse is HTML, which is beyond the power of a regex (see stackoverflow.com/a/1732454/548562). Use one of the HTML-parsing libraries instead, like BeautifulSoup.
You don't care about the HTML and you know it'll always have exactly this form, perhaps because it's generated from a template. In that case, you can use a pattern like r'div id="UserInputtedText">(.*)</div></h3>'
```
import html, re
m = re.search(r'div id="UserInputtedText">(.*)</div></h3>', response)
if m is None:
    ... handle the situation ...
text = html.unescape(m.group(1).replace('<br />', '\n'))
```

In principle, using an HTML parser is the better solution. In practice, when web-scraping, you're in any case not guaranteed that the element with id="UserInputtedText" will always have the same id (unless you have some sort of agreement with the other side), at which point most of the advantages go away.

If you're going to do a lot of processing on webpages, BeautifulSoup is still an advantage, because it's a lot easier to avoid accidentally matching something other than what you intended. If there's just the one web page, though, it's pretty even which will be easier and which will be less likely to break.

How to get plain text between two regex patterns python

1 Answers1