RegEx for Capturing HTML text with Python

Question

I'm trying to grab paragraphs of text off a website with RegEx to put into a Python list, but for this particular website I'm having difficulty with formatting the RegEx to capture all the events. Can anyone help with gathering results from all instances? Or at least tell me if it's not practical and I'll find an alternate website.

from re import *
from urllib.request import urlopen

## Create Empty List
EventInfoListBEC = []

## Asign Website to a Variable
WebsiteBEC = 'https://www.brisent.com.au/Event-Calendar'

## Search for Event Info
EventInfoBEC = findall('<p class="event-description">(.+?)</p>', WebsiteBEC)

## Add Event Info to Event Info List and Print Details
print('Event Info appears', len(EventInfoBEC), 'times (BEC).')
for EventInfo in EventInfoBEC:
    EventInfoListBEC.append(EventInfo)
print(EventInfoListBEC)

## There are Three Styles of Input from the HTML File
# One
<p class="event-description"><p>This is a sport where 8 seconds can cost you everything. Welcome to the world of the PBR.</p>

</p>

# Two
<p class="event-description"><p style="text-align: justify; color: rgb(0, 0, 0); font-family: sans-serif; font-size: 12px;">Fresh off the back of winning a Brit Award for &lsquo;British Artist Video of the Year&rsquo; for &lsquo;Woman Like Me&rsquo;, and two Global Awards for &lsquo;Best Group&rsquo; and &lsquo;Best Song&rsquo;; pop superstars Little Mix today announce that five new Australian shows have been added to &#39;LM5 - The Tour&#39; for 2019!</p>

</p>

#Three
<p class="event-description"><p style="font-family: sans-serif; font-size: 12px; color: rgb(0, 0, 0); text-align: center;"><strong>OPENING NIGHT PERFORMANCE ADDED!</strong></p>



<p style="font-family: sans-serif; font-size: 12px; color: #000000; text-align: justify;">The world&rsquo;s most beloved movie-musical comes to life on the arena stage&nbsp;like you&rsquo;ve never seen it before! From the producers of GREASE - THE ARENA EXPERIENCE comes this lavish new arena production of THE WIZARD OF OZ.</p>

https://imgur.com/a/J61seeJ Images of the HTML code if this is confusing — ZooKeeper, May 25 '19 at 07:17
@Thefourthbird I can't use any modules that need to be downloaded separate from those that come with Python — ZooKeeper, May 25 '19 at 07:20
Don't use regex to parse HTML. [Here's why](https://stackoverflow.com/a/590789/4934172). Please, save yourself a lot of trouble and [just don't](https://stackoverflow.com/a/1732454/4934172). — 41686d6564 stands w. Palestine, May 25 '19 at 08:13
[Parsing HTML with regex is a hard job](https://stackoverflow.com/a/4234491/372239) — Toto, May 25 '19 at 09:00

score 0 · Answer 1 · answered May 25 '19 at 10:26

As indicated by many, there are better ways than using regex: I like using lxml (lxml.html) but bs4 would do the job as well.

Anyway, here is a solution using the module regex (in this module lookbehinds can have variable lengths unlike in re). The solution relies on the regex

(?<=class="event-description"><p[\w\s\#\;\(\)\"\=\:\-\,]*>).*(?=</p>)

which captures the content of the paragraphs inside the event-description class. The custom group [\w\s\#\;\(\)\"\=\:\-\,] contains all the characters used in the style arguments. Finally, the start * allows empty styles to be matched as well.

# import regex
# import requests

# Asign Website to a Variable
WebsiteBEC = 'https://www.brisent.com.au/Event-Calendar'

# Get source code
req = requests.get(WebsiteBEC, timeout=5)
source_code = req.text

# Extract data
EventInfoBEC = regex.findall(r'(?<=class="event-description"><p[\w\s\#\;\(\)\"\=\:\-\,]*>).*(?=</p>)', source_code)
# ['This is a sport where 8 seconds can cost you everything. Welcome to the world of the PBR.',
#  'See fearless Moana with demigod Maui, follow Dory through the Pacific Ocean, join the Toy Story pals on an exciting adventure and discover true love with Elsa and Anna. Buckle in for the emotional rollercoaster of Inside Out and &ldquo;Live Your Story&rdquo; alongside Disney Princesses as they celebrate their favourite Disney memories!',
#  'Fresh off the back of winning a Brit Award for &lsquo;British Artist Video of the Year&rsquo; for &lsquo;Woman Like Me&rsquo;, and two Global Awards for &lsquo;Best Group&rsquo; and &lsquo;Best Song&rsquo;; pop superstars Little Mix today announce that five new Australian shows have been added to &#39;LM5 - The Tour&#39; for 2019!',
#  '<strong>OPENING NIGHT PERFORMANCE ADDED!</strong>',
#  '<strong>THIRD SHOW ANNOUNCED - ON SALE FROM 2PM FRI 1 FEB!</strong>',
#  '<strong>COMING TO AUSTRALIA FOR THE VERY FIRST TIME.&nbsp;</strong>',
#  'WWE LIVE is returning to Australia!&nbsp;Fans will be able to see their favorite WWE Superstars for the first time since last year&rsquo;s incredible Super Show-Down',
#  '<strong>SHAWN MENDES ANNOUNCES RUEL AS SPECIAL GUEST + ADDITIONAL TICKETS AVAILABLE FOR ALL SHOWS!</strong>',
#  'Steve Martin and Martin Short will bring their critically acclaimed comedy tour Now You See Them, Soon You Won&rsquo;t for the first time to Australian audiences in November.&nbsp;',
#  'After an epic and storied 45-year career that launched an era of rock n roll legends, KISS announced that they will launch their final tour ever in 2019, appropriately named END OF THE ROAD.',
#  '<strong>ELTON JOHN ANNOUNCES 3RD BRISBANE SHOW!</strong>']

One still needs to process the result to get rid of the <strong> tags. Also, the last line in the provided source code above is not of class event-description, hence it will not be captured by the regex.

RegEx for Capturing HTML text with Python

1 Answers1