Better way to extract country wise player list from html file using regular expression

Question

Problem statement:

Make country wise player list from the following html code

<ul>
    <li>
        Australia
        <ol>
            <li>Steven Smith</li>
            <li>David Warner</li>
        </ol>
    </li>
    <li>
        Bangladesh
        <ol>
            <li>Mashrafe Mortaza</li>
            <li>Tamim Iqbal</li>
        </ol>
    </li>
    <li>
        England
        <ol>
            <li>Eoin Morgan</li>
            <li>Jos Buttler</li>
        </ol>
    </li>
</ul>

Expected Output:

Australia- Steven Smith, David Warner

Bangladesh- Mashrafe Mortaza, Tamim Iqbal

England- Eoin Morgan, Jos Buttler

My Code:

It works well. I'm looking for better code. Please help me.

import re

with open('playerlist.html', 'r') as f:
    text = f.read()

mytext = re.sub(r'[\n\t]', '', text)

pat = r'<li>(\w+?)<ol><li>(\w+\s?\w+)</li><li>(\w+\s?\w+)</li>'

cpat = re.compile(pat)

result = cpat.findall(mytext)


for a,b,c in result:
    print('{0}- {1}, {2}'.format(a,b,c))

What are you looking to improve on in the existing algorithm? — Holland Wilson, Oct 27 '17 at 06:09
If it really works and there is no issue, please consider posting it at [codereview.se], it is off-topic here. — Wiktor Stribiżew, Oct 27 '17 at 06:46
yes, there is a big problem in your code: you should NOT parse xml/html with regex — RomanPerekhrest, Oct 27 '17 at 06:59

score 0 · Answer 1 · answered Oct 27 '17 at 07:10

Parsing xml/html data with regex never was and never will be a good idea.
Use xml/html parsers.

The right way with xml.etree.ElementTree module (one of those parsers. You could try another ones):

import xml.etree.ElementTree as ET

root = ET.parse('playerlist.html').getroot()
for li in root.findall('.//li[ol]'):
    print(li.text.strip(), '- {}, {}'.format(*(i.text.strip() for i in li.findall('ol/li'))))

The output:

Australia - Steven Smith, David Warner
Bangladesh - Mashrafe Mortaza, Tamim Iqbal
England - Eoin Morgan, Jos Buttler

score 0 · Answer 2 · answered Jan 23 '20 at 12:37

0

After substitution of newline and tab to "" my regex pattern looks like this.

r'<li>\s*(\w+?)\s*<ol>\s*<li>\s*(\w+\s?\w+)\s*</li>\s*<li>\s*(\w+\s?\w+)\s*</li>'

answered Jan 23 '20 at 12:37

Akil Mahmod Tipu

1
2

Better way to extract country wise player list from html file using regular expression

Problem statement:

Expected Output:

My Code:

2 Answers2