Trying to get the html from a website

Question

def main:
with open(sourcefile, 'r', encoding='utf-8') as main_file:
    for line in main_file:
        htmlcontent = reader(line)

def reader(line):

    with urllib.request.urlopen(line) as url_file:
      try:
          url_file.read().decode('UTF-8')
      except urllib.error.URLError as url_err:
          print('Error opening url: ', url, url_err)
      except UnicodeDecodeError as decode_err:
          print('Error decoding url: ', url, decode_err)
 return url_file

Hello everyone, I am pretty new to python and I have a question regarding reading the HTML code from a website. So I am using regular expressions as shown, and I am trying to simply return the HTML code from a website. The variable line takes in URLs from a text file, which has lines of URL so it iterates through it. This is my code so far, but there are multiple errors that are popping up. I know that I have to use the else clause, and I don't know how to incorporate that. I intend to use the returned HTML value as a subject for regex. I also hope to get the HTML using urllib.request library.

What do you exactly want to do? There are many useful libraries for parsing websites available — Nils, Mar 13 '18 at 02:13
@Nils I'm trying to get the html code, so I can then use regex on the code to find certain patterns present in the code. But first, I simply have to get the html from the website. I was told to have a try, except, else, in cause of errors when going about this. Also, I intend to go about this using urllib.request library. — newbie123123, Mar 13 '18 at 02:23

score 2 · Answer 1 · answered Mar 13 '18 at 02:17

2

It's better to use requests module. One liner code

import requests

html = requests.get("www.domain.tld").text

answered Mar 13 '18 at 02:17

bigbounty

16,526
5
37
65

Thank you, but I am trying to solve it using urllib.request! – newbie123123 Mar 13 '18 at 02:24
@newbie123123, have a look at this: https://stackoverflow.com/questions/2018026/what-are-the-differences-between-the-urllib-urllib2-and-requests-module – Keyur Potdar Mar 13 '18 at 02:35

score 0 · Answer 2 · answered Mar 13 '18 at 02:13

0

This saves the website content in html_content and prints it

import urllib

url = "www.domain.tld"

seed_url = urllib.urlopen(url)
html_content = seed_url.read()
seed_url.close()

print(html_content)

answered Mar 13 '18 at 02:13

rwx

696
8
25

Trying to get the html from a website

2 Answers2