I am trying to parse an HTML Page using Regualr Expressions. I have to find out the sum of all comments from this web page: https://py4e-data.dr-chuck.net/comments_42.html Everything else is working fine but the re.findall function is only picking up second digit of a two digit number. I am not able to figure out why is this happening.
This is my code:
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
import re
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
code = list()
html = urllib.request.urlopen("https://py4e-data.dr-chuck.net/comments_42.html", context=ctx)
for line in html:
line = line.decode()
line = line.strip()
numbers = re.findall("<span.+([0-9]+)", line)
if len(numbers) != 1: continue
print(numbers)
This is my output: (I am geting 7 instead of 97, 0 instead of 90) output