So,I'm given an HTML file which includes some country name and two players from that country. I have to read that html file and show the country and players name in a specific format using regular expression.
The HTML code is given below:
<ul>
<li>
Australia
<ol>
<li> Steven smith </li>
<li> David Warner </li>
</ol>
</li>
<li>
Bangladesh
<ol>
<li> Mashrafe Mortaza </li>
<li> Tamim Iqbal </li>
</ol>
</li>
<li>
England
<ol>
<li> Eoin Morgan </li>
<li> Jos Buttler </li>
</ol>
</li>
</ul>
I have to show it in this format:
Australia - Steven Smith, David Warner
Bangladesh - Mashrafe Mortaza, Tamim Iqbal
England - Eoin Morgan, Jos Buttler
I've tried something but haven't got so far. This is what I've been able to come up with so far:
>> with open("test.html","r") as f:
text = f.read()
>> import re
>> pq = re.findall(r'^<li>\n(.+?)\n\t<ol>\n\t<li>(.+?)</li>\n\t<li>(.+?)
</li>$',text,re.M)
The output looks like this:
[('Australia', ' Steven smith ', ' David Warner '),
('Bangladesh', ' Mashrafe Mortaza ', ' Tamim Iqbal '),
('England', ' Eoin Morgan ', ' Jos Buttler ')]
This is not what I wanted. The country names seems to be fine. But the players name contain the tabs. I'm new at regular expression and I'm not entirely sure what to do here. Any help would be appreciated.