Query about regular expression

Question

So,I'm given an HTML file which includes some country name and two players from that country. I have to read that html file and show the country and players name in a specific format using regular expression.

The HTML code is given below:

<ul>
<li>
Australia
    <ol>
    <li> Steven smith </li>
    <li> David Warner </li>
    </ol>
</li>
<li>
Bangladesh
    <ol>
    <li> Mashrafe Mortaza </li>
    <li> Tamim Iqbal  </li>
    </ol>
</li>
<li>
England
    <ol>
    <li> Eoin Morgan </li>
    <li> Jos Buttler </li>
    </ol>
</li>
</ul>

I have to show it in this format:

Australia - Steven Smith, David Warner
Bangladesh - Mashrafe Mortaza, Tamim Iqbal
England - Eoin Morgan, Jos Buttler

I've tried something but haven't got so far. This is what I've been able to come up with so far:

>> with open("test.html","r") as f:
      text = f.read()
>> import re
>> pq = re.findall(r'^<li>\n(.+?)\n\t<ol>\n\t<li>(.+?)</li>\n\t<li>(.+?)
               </li>$',text,re.M)

The output looks like this:

[('Australia', ' Steven smith ', ' David Warner '),
('Bangladesh', ' Mashrafe Mortaza ', ' Tamim Iqbal  '),
('England', ' Eoin Morgan ', ' Jos Buttler ')]

This is not what I wanted. The country names seems to be fine. But the players name contain the tabs. I'm new at regular expression and I'm not entirely sure what to do here. Any help would be appreciated.

don't use regexes to parse XML/HTML. use a proper parser like `lxml` — Jean-François Fabre, Jan 10 '18 at 20:10
[H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — ctwheels, Jan 10 '18 at 20:11

score 2 · Accepted Answer · answered Jan 10 '18 at 20:28

You can use a combination of a parser and a regular expression:

from bs4 import BeautifulSoup
import re

rx = re.compile(r'''
    ^
    (?P<country>\w+)\s*
    (?P<player1>.+)[\n\r]
    (?P<player2>.+)''', re.MULTILINE | re.VERBOSE)

soup = BeautifulSoup(your_string_here, 'lxml')

players = ["{} - {}, {}".format(m.group('country'), m.group('player1').strip(), m.group('player2').strip()) 
            for item in soup.select('ul > li')
            for m in rx.finditer(item.text)]
print(players)

Which yields

['Australia - Steven smith, David Warner', 'Bangladesh - Mashrafe Mortaza, Tamim Iqbal', 'England - Eoin Morgan, Jos Buttler']

Query about regular expression

1 Answers1