0

I am using Python to read HTML data, but I have difficulties finding all substrings between "d:Title>Good To Great</d:Title>" from this HTML.

data = "<html><head></head><body><pre style='word-wrap': break-word; white-space: pre-wrap;
d:Title&gt;Good To Great&lt;/d:Title&gt;d:ComplianceAssetId m:null='true'/&gt;
d:Title&gt;War and Peace&lt;/d:Title&gt;/d:ComplianceAssetId m:null='false'/&gt; 
d:Title&gt;The Great Gatsby&lt;/d:Title&gt;/entry&gt;&lt;/feed&gt;</pre></body></html>"

Expected output:

['Good To Great', 'War and Peace', 'The Great Gatsby']

I suspect regex could be a solution but I have limited knowledge about the regex (still learning), can anyone help me with the problem?

Thanks in advance for your help.

Logica
  • 977
  • 4
  • 16
Bangbangbang
  • 560
  • 2
  • 12
  • hi bangbangbangbang, please look into the re package for details on the built in package. You can also google 'Dive into Python 3' where you can find a really handy book that covers basic python 3 including regex handling. – cyneo Mar 06 '20 at 05:13
  • *I suspect regex could be a solution* : See [this](https://stackoverflow.com/a/1732454/2928853) if you haven't already. – jrook Mar 06 '20 at 05:24

2 Answers2

1
>>> re.findall('Title&gt;(.*)&lt;/d:Title', data)
['Good To Great', 'War and Peace', 'The Great Gatsby']

You can use the wildcard character . to find the text.

alec
  • 5,799
  • 1
  • 7
  • 20
  • Thank you. But my output is ["Good To Great</d:Title>d:ComplianceAssetId m:null='true'/>d:Title>War and Peace</d:Title>/d:ComplianceAssetId m:null='false'/> d:Title>The Great Gatsby"] Is there anyway to keep book names in the middle only? – Bangbangbang Mar 06 '20 at 05:27
  • That's strange. The parentheses should filter out the other text from the return value. – alec Mar 06 '20 at 05:29
  • 1
    The `.*` should be `.*?` to make it ungreedy. – Richard van Velzen Mar 06 '20 at 05:39
1

regex is 'Title&gt;([\w\s]+)&lt;/d:Title'

solution output

Python version 3.7. I hope this helps.

jose praveen
  • 1,298
  • 2
  • 10
  • 17