Python - How to find all substrings with a pattern in HTML?

Question

I am using Python to read HTML data, but I have difficulties finding all substrings between "d:Title>Good To Great&lt;/d:Title>" from this HTML.

data = "<html><head></head><body><pre style='word-wrap': break-word; white-space: pre-wrap;
d:Title&gt;Good To Great&lt;/d:Title&gt;d:ComplianceAssetId m:null='true'/&gt;
d:Title&gt;War and Peace&lt;/d:Title&gt;/d:ComplianceAssetId m:null='false'/&gt; 
d:Title&gt;The Great Gatsby&lt;/d:Title&gt;/entry&gt;&lt;/feed&gt;</pre></body></html>"

Expected output:

['Good To Great', 'War and Peace', 'The Great Gatsby']

I suspect regex could be a solution but I have limited knowledge about the regex (still learning), can anyone help me with the problem?

Thanks in advance for your help.

hi bangbangbangbang, please look into the re package for details on the built in package. You can also google 'Dive into Python 3' where you can find a really handy book that covers basic python 3 including regex handling. — cyneo, Mar 06 '20 at 05:13
*I suspect regex could be a solution* : See [this](https://stackoverflow.com/a/1732454/2928853) if you haven't already. — jrook, Mar 06 '20 at 05:24

alec · Answer 1 · 2020-03-06T05:41:51.087

1

>>> re.findall('Title&gt;(.*)&lt;/d:Title', data)
['Good To Great', 'War and Peace', 'The Great Gatsby']

You can use the wildcard character . to find the text.

edited Mar 06 '20 at 05:41

answered Mar 06 '20 at 05:23

alec

5,799
1
7
20

Thank you. But my output is ["Good To Great</d:Title>d:ComplianceAssetId m:null='true'/>d:Title>War and Peace</d:Title>/d:ComplianceAssetId m:null='false'/> d:Title>The Great Gatsby"] Is there anyway to keep book names in the middle only? – Bangbangbang Mar 06 '20 at 05:27
That's strange. The parentheses should filter out the other text from the return value. – alec Mar 06 '20 at 05:29
1

The `.*` should be `.*?` to make it ungreedy. – Richard van Velzen Mar 06 '20 at 05:39

score 1 · Accepted Answer · answered Mar 06 '20 at 05:41

1

regex is 'Title>([\w\s]+)</d:Title'

Python version 3.7. I hope this helps.

answered Mar 06 '20 at 05:41

jose praveen

1,298
2
10
17

Python - How to find all substrings with a pattern in HTML?

2 Answers2