-4

Below is my code to web scrape. I am looking to return the result of the regex, but for some reason it is only returning '[ ]'.

Any help would be hugely appreciated.

Thanks

import urllib.request
import re

url = ('https://www.myvue.com/whats-on')
req = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0'})

def main():
    html_page = urllib.request.urlopen(req).read()
    content=html_page.decode(errors='ignore', encoding='utf-8')
    headings = re.findall('<th scope="col" abbr="(.*?)">', content)
    print(headings)

main()
zwer
  • 24,943
  • 3
  • 48
  • 66
Jdsmith
  • 13
  • 1
  • 1
  • 4
  • Ummm, because that pattern is not found anywhere on that page? You shouldn't be using regex to parse multi-level/hierarchical structures (like HTML) anyway - use something written for that purpose, like [`BeautifulSoup`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). – zwer Jul 12 '17 at 15:37
  • [Don't use regex on HTML/XHTML](https://stackoverflow.com/a/1732454/1040092) – Wondercricket Jul 12 '17 at 15:41
  • @zwer I know! But I have been told to! – Jdsmith Jul 12 '17 at 15:49
  • @Jdsmith, send the SO answer that Wondercricket has linked to whoever asked you to commit such a crime. – Tom Wyllie Jul 12 '17 at 16:35
  • @Jdsmith, I know in the past I have been told by managers to try something one way only find out it did not work properly... What I have done in the past is 1) give it the old college try, talk to others too. If it does not work using that technique, 2) don't be afraid to bring this info to your boss. Don't be mad that your boss sent you down a rabbit hole. He did the best he could too by guessing and was trying to be helpful too. Treat the circumstance with intellectual curiosity and honesty. He cannot fault you for trying and you both may learn something. – mccurcio Jul 12 '17 at 16:37
  • @Jdsmith Does the current answer not answer your question? – cs95 Jul 13 '17 at 08:49
  • @COLDSPEED unfortunately I cant use Beautiful Soup for this. Therefore this doesn't really answer my question. – Jdsmith Jul 13 '17 at 12:03

1 Answers1

0

Like everyone says, do not use regex to parse well structured data with abundant of parsers already available. However, as you stated "you have been told to do so", here is a tip.

Test your regex on some of the patterns you are trying to capture outside your script, as in do something like this::

re.compile('<th scope="col" abbr="(.*)">').match('<th scope="col" abbr="hello">').groups()

When you get the pattern absolutely correct, only then run it against that large html file. Notice how I removed the ? from your regex, as you already had a *.

Meitham
  • 9,178
  • 5
  • 34
  • 45