I've been trying to automate the process of downloading images from certain sites, and people told me to use Python. The website's pages are in the format http://site/... /number.html
.
Piecing together things from different sources, I ended up with this-
import os
import urllib
from urllib import urlopen
import BeautifulSoup
from BeautifulSoup import BeautifulSoup
import urllib2
import requests
import re
import hashlib
def md5(fname):
hash_md5 = hashlib.md5()
with open(fname, "rb") as f:
for chunk in iter(lambda: f.read(4096), b""):
hash_md5.update(chunk)
return hash_md5.hexdigest()
url = input()
usplit = re.split('/|\.', url)
title = usplit[5]
volume = None
if (usplit[6][0] == 'v'):
volume = usplit[6]
chapter = usplit[7]
pg_no = int(usplit[8])
else:
chapter = usplit[6]
pg_no = int(usplit[7])
if (volume is not None):
mpath = ".\\" + title + "\\" + volume + "\\" + chapter
if not os.path.isdir(mpath):
os.makedirs(mpath)
else:
mpath = ".\\" + title + "\\" + chapter
if not os.path.isdir(mpath):
os.makedirs(mpath)
while (1):
flg = 0
r = requests.get(url)
if (r.status_code!=200):
print "Exception: Access!"
exit()
print "Getting content from " + url
html = r.content
page = BeautifulSoup(html)
image = page.findAll('img')[0]['src']
res = urllib.urlopen (image)
prevfile = mpath + "\\" + str(pg_no-1) + ".jpg"
file = mpath + "\\" + str(pg_no) + ".jpg"
if (not (os.path.isfile(file))):
print "Writing to... " + file
output = open(file,"wb")
output.write(res.read())
output.close()
if (flg==1):
if (md5(file) == md5(prevfile)):
print "All done!"
exit()
print "Done."
else:
print str(pg_no) + ".jpg already exists, skipping..."
flg = 1
pg_no+=1
if (volume is not None):
newurl = usplit[0] + "//" + usplit[2] + "." + usplit[3] + "/" + usplit[4] + "/" + title + "/" + volume + "/" + chapter + "/" + str(pg_no) + "." + usplit[9]
else:
newurl = usplit[0] + "//" + usplit[2] + "." + usplit[3] + "/" + usplit[4] + "/" + title + "/" + chapter + "/" + str(pg_no) + "." + usplit[8]
url = newurl
The problem is, after I reach the last image, the website redirects me to the last valid page. That is, if 46.html
is the last page, the request for 47.html
is redirected to it, and r.status_code
remains the same.
To circumvent this, I tried to compare the last file downloaded, and the current file, and terminate the program. However, this does not seem to work. I am new to this, and unsure as to how to compare the files, and md5 function was something I found here. I tried using filecmp
too, but it doesn't seem to work either.
Any suggestions? Also, regarding the code, is there anything which could be made more Python-y?