0

I've been trying to automate the process of downloading images from certain sites, and people told me to use Python. The website's pages are in the format http://site/... /number.html.

Piecing together things from different sources, I ended up with this-

import os
import urllib
from urllib import urlopen
import BeautifulSoup
from BeautifulSoup import BeautifulSoup
import urllib2
import requests
import re
import hashlib

def md5(fname):
    hash_md5 = hashlib.md5()
    with open(fname, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hash_md5.update(chunk)
    return hash_md5.hexdigest()

url = input()
usplit = re.split('/|\.', url)
title = usplit[5]
volume = None

if (usplit[6][0] == 'v'):
    volume = usplit[6]
    chapter = usplit[7]
    pg_no = int(usplit[8])
else:
    chapter = usplit[6]
    pg_no = int(usplit[7])

if (volume is not None):
    mpath = ".\\" + title + "\\" + volume + "\\" + chapter
    if not os.path.isdir(mpath):
        os.makedirs(mpath)
else:
    mpath = ".\\" + title + "\\" + chapter
    if not os.path.isdir(mpath):
        os.makedirs(mpath)

while (1):
    flg = 0
    r = requests.get(url)
    if (r.status_code!=200):
        print "Exception: Access!"
        exit()
    print "Getting content from " + url
    html = r.content
    page = BeautifulSoup(html)
    image = page.findAll('img')[0]['src']
    res = urllib.urlopen (image)
    prevfile = mpath + "\\" + str(pg_no-1) + ".jpg"
    file = mpath + "\\" + str(pg_no) + ".jpg"
    if (not (os.path.isfile(file))):
        print "Writing to... " + file
        output = open(file,"wb")
        output.write(res.read())
        output.close()
        if (flg==1):
            if (md5(file) == md5(prevfile)):
                print "All done!"
                exit()
        print "Done."
    else:
        print str(pg_no) + ".jpg already exists, skipping..."
    flg = 1
    pg_no+=1
    if (volume is not None):
        newurl = usplit[0] + "//" + usplit[2] + "." + usplit[3] + "/" + usplit[4] + "/" + title + "/" + volume + "/" + chapter + "/" + str(pg_no) + "." + usplit[9]
    else:
        newurl = usplit[0] + "//" + usplit[2] + "." + usplit[3] + "/" + usplit[4] + "/" + title + "/" + chapter + "/" + str(pg_no) + "." + usplit[8]
    url = newurl

The problem is, after I reach the last image, the website redirects me to the last valid page. That is, if 46.html is the last page, the request for 47.html is redirected to it, and r.status_code remains the same. To circumvent this, I tried to compare the last file downloaded, and the current file, and terminate the program. However, this does not seem to work. I am new to this, and unsure as to how to compare the files, and md5 function was something I found here. I tried using filecmp too, but it doesn't seem to work either.

Any suggestions? Also, regarding the code, is there anything which could be made more Python-y?

Community
  • 1
  • 1
Athena
  • 155
  • 7

2 Answers2

1

Assuming that the html content of these sites is not identical you could compare the content:

import requests

r = requests.get("http://site/... /46.html")
next = requests.get("http://site/... /47.html")
if r.content == next.content:
    print("Site visited already")

If you want to break the while loop, you can use the break statement.

Tristan
  • 1,576
  • 9
  • 12
0

You're defining flg repeatedly. Put it out of the loop.

Athena
  • 155
  • 7