I am trying to write a Python script which acts similar to Ctrl + S on a Chrome web browser, it saves the HTML page, downloads any links on the webpage and finally, replaces the URIs of the links with the local path on disk.
The code posted below attempts to replace the URIs in for CSS files with local paths on my computer.
I have come across an issue when attempting to parse different sites, and it's becoming a bit of a headache.
The original error code I have is UnicodeDecodeError: 'ascii' codec can't decode byte 0xa3 in position 13801: ordinal not in range(128)
url = 'http://www.s1jobs.com/job/it-telecommunications/support/edinburgh/620050561.html'
response = urllib2.urlopen(url)
webContent = response.read()
dest_dir = 'C:/Users/Stuart/Desktop/' + title
for f in glob.glob(r'./*.css'):
newContent = webContent.replace(cssUri, "./" + title + '/' + cssFilename)
shutil.move(f, dest_dir)
This issue persists either when I attempt to print newContent or write it to a file. I attempted to follow the top answer in this Stack question UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 13: ordinal not in range(128) and modified my line
newContent = webContent.decode('utf-8').replace(cssUri, "./" + title + '/' + cssFilename)
to newContent = webContent.decode(utf-8).replace(cssUri, "./" + title + '/' + cssFilename)
. I have also attempted .decode(utf-16)
and 32 where I get these error codes respectively: 13801: invalid start byte
, byte 0x0a in position 44442: truncated data
and finally can't decode bytes in position 0-3: code point not in range(0x110000)
Does anyone have any idea to how I should remedy this issue? I must add that when I print variable webContent, there is output (I noticed Chinese writing at the bottom though).