0

I would like to check the website "https://haveibeenpwned.com/Passwords" of Troy Hunt automated with Python, if he has published a new password file. For this I read the website and would like to search it for a string to get the current version of the file. These are always named after the schema ....v5.7z. v stands here for the version.

# -*- coding: utf-8 -*-

import os
import urllib2
#from urllib2 import Request

from urllib2 import Request, urlopen, URLError, HTTPError
someurl='https://haveibeenpwned.com/Passwords'
req = Request(someurl, headers={'User-Agent': 'Mozilla/5.0'})
try:
    response = urlopen(req)
except HTTPError as e:
    print 'The server couldn\'t fulfill the request.'
    print 'Error code: ', e.code
except URLError as e:
    print 'We failed to reach a server.'
    print 'Reason: ', e.reason
else:
    print  "everything is fine"
    response = urllib2.urlopen(req)
    the_page = response.read()
    print(the_page)


in "the_page" is the entire HTML code of the page. How can I search it?

im not allowed to use beautifulsoap or an parser..

  • Possible duplicate of [Parsing HTML using Python](https://stackoverflow.com/questions/11709079/parsing-html-using-python) – Marius Mucenicu Oct 02 '19 at 13:23
  • If you're "not allowed to use beautifulsoap [sic]", or any kind of parser, then you won't be able to do this. You could try regular expressions, but these are a kind of parser too, so you're stuck with writing your own code for analysing the contents of the string `the_page`. But wait, that would be a kind of parser as well, and you aren't allowed to use these! – ForceBru Oct 02 '19 at 13:32
  • Maybe Selenium? – powerPixie Oct 02 '19 at 13:38

0 Answers0