-1

I have got an HTML file and I read with Python and I would like to while I print customize it.

First I've to print Country name then players name which they belong to their country.

My HTML file looks like this:

<ul>
<li>
    Australia
    <ol>
        <li>Steve Smith</li>
        <li>David Warner</li>
        <li>Aaron Finch</li>
    </ol>
</li>

<li>
    Bangladesh
    <ol>
        <li>Shakib Al Hasan</li>
        <li>Tamim Iqbal</li>
        <li>Mushfiqur Rahim</li>
    </ol>
</li>


<li>
    England
    <ol>
        <li>Ben Stokes</li>
        <li>Joe Root</li>
        <li>Eoin Morgan</li>
    </ol>
</li>

Now I want to scrape this data from my HTML file:

Australia - Steve Smith, David Warner, Aaron Finch
Bangladesh - Shakib Al Hasan, Tamim Iqbal, Mushfiqur Rahim
England - Ben Stokes, Joe Root, Eoin Morgan

But I can only scrape with Players' name. This is my code:

import re

file_name = "team.html"
mode = "r"    

with open(file_name, mode) as fp:
    team = fp.read()

pat =  re.compile(r'<li>(.*?)</li>')
result = pat.findall(team)
res = ", ".join([str(player) for player in result])
print(res)

Also, I don't' use any package like bs4. I would like to solve this issue by using regex.

mkrieger1
  • 19,194
  • 5
  • 54
  • 65
  • 5
    Use an HTML parser. – eumiro May 03 '20 at 20:15
  • 1
    Does this answer your question? [RegEx match open tags except XHTML self-contained tags](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – mkrieger1 May 03 '20 at 20:17

3 Answers3

1

Here the solution with using regex.

import re

file_name = "team.html"
mode = "r"    

with open(file_name, mode) as fp:
    team = fp.read()

regex =  re.compile(r'<li>\s+(?P<country>[A-z ]+)|<li>(?P<name>[A-z ]+)</li>')

country_team_rel = {}
country = None
for result in regex.findall(team):
    if result[0]:
        country = result[0]
        country_team_rel[country] = []
    else:
        country_team_rel[country].append(result[1])

# Or If you like to print
buffer = []
for result in regex.findall(team):
    if result[0]:
        if buffer:
            print(", ".join(buffer))
            buffer = []
        print(result[0] + " - ", end='')
    else:
        buffer.append(result[1])
print(", ".join(buffer))
ErdoganOnal
  • 800
  • 6
  • 13
0

As already suggested, BeautifulSoup is the right tool for this task:

import bs4
file_name = "team.html"
mode = "r"    
with open(file_name, mode) as fp:
    team = fp.read()
soup = bs4.BeautifulSoup(team)
country = None
for i in soup.findAll('li'):
    if '\n' in i.text: 
         if country:
             print(country,'-', ', '.join(players))
         country = i.text.splitlines()[1].strip()
         players = []
    else:
         players.append(i.text)
print(country,'-', ','.join(players))    
Błotosmętek
  • 12,717
  • 19
  • 29
0

It could be a mistake to use regex in this case. (i am not 100% sur). You should use Beautiful Soup

Or even other HTML parser