RegEx for capturing an element textContent

Question

Just trying to grab the titles of events from a website and I have most of them, but It won't pick up one title. The missing result is:

AFL U16’s Championships

Can someone tell me what I need to change in my Regex to find this?

from re import *
from urllib.request import urlopen

Website = 'https://thegabba.com.au/what-s-on.aspx'
print('Now Gathering Results from URL: ' + Website)

html_source = urlopen(Website).read().decode("UTF-8")
EventMatches = findall('<h6 class="event-title">([A-Za-z0-9\'\\s]+)</h6>',html_source)

print('There are ' + str(len(EventMatches)) + ' Events.')

for EventNames in EventMatches:
    print(EventNames)

@Emma The output looks like this https://imgur.com/a/t8u6L1r and the desired output is the same, except with an additional result, which is the AFL U16’s Championships — ZooKeeper, May 25 '19 at 02:46
Anyone trying to parse HTML with a regexp should be aware: [the pony, he comes](https://stackoverflow.com/a/1732454/238884). — Michael Lorton, May 25 '19 at 02:49

score 2 · Answer 1 · answered May 25 '19 at 02:36

2

The apostrophe ’ is not the same as the single quote '. You need to allow for the former as well as the latter if you want that result included.

answered May 25 '19 at 02:36

paxdiablo

854,327
234
1,573
1,953

So how do I include a single quote? – ZooKeeper May 25 '19 at 02:42
1

@ZooKeeper, you *have* the single quote, what you need is the apostrophe. Just add it to the `[A-Za-z0-9\'\\s]` character class using whatever encoding you have (it's not ASCII, I'm assuming it's Unicode since you mention `UTF-8`).). – paxdiablo May 25 '19 at 02:47

score 0 · Accepted Answer · edited Jun 20 '20 at 09:12

The expression we might want here would be:

<h6 class="event-title">(.+?)<\/h6>

which captures everything in the h6 tags.

DEMO

Test

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r"<h6 class=\"event-title\">(.+?)<\/h6>"

test_str = "<h6 class=\"event-title\">Brisbane Lions v Hawthorn Football Club and Anthing else we wish here including @#$%^&*(</h6>"

matches = re.finditer(regex, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches, start=1):
    
    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
    
    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1
        
        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

RegEx Circuit

jex.im visualizes regular expressions:

score 0 · Answer 3 · answered May 25 '19 at 02:52

content was actually returning binary not utf-8/ascii so decoded to iso-8895-1

#!/usr/bin/python3
import re
import requests

Website = 'https://thegabba.com.au/what-s-on.aspx'
print('Now Gathering Results from URL: {}'.format(Website))

html_source = requests.get(Website).content.decode('ISO-8859-1') 
EventMatches = re.findall(r'<h6 class="event-title">([A-Za-z0-9\'\s]+)<\/h6>', html_source)

print('There are {} Events.'.format(len(EventMatches)))

for EventNames in EventMatches:
    print(EventNames)

Now Gathering Results from URL: https://thegabba.com.au/what-s-on.aspx
There are 14 Events.
Brisbane Lions v Hawthorn Football Club
Brisbane Lions v Melbourne Football Club
Brisbane Lions v North Melbourne Football Club
Stadium Stomp
Brisbane Lions v Western Bulldogs
Brisbane Lions v Gold Coast Suns
Muscle Up For MND
Brisbane Lions v Geelong Football Club
Australia v Sri Lanka
Australia v Pakistan
Pakistan v New Zealand
Australia v A1
New Zealand v A1
England v Afghanistan

RegEx for capturing an element textContent

3 Answers3

DEMO

Test

RegEx Circuit