1

I have to write in Python that performs the following tasks:

1- Download the Movielens datasets from the url ‘http://files.grouplens.org/datasets/movielens/ml- 25m.zip’
2- Download the Movielens checksum from the url ‘http://files.grouplens.org/datasets/movielens/ml- 25m.zip.md5’
3- Check whether the checksum of the archive corresponds to the downloaded one
4- In case of positive check, print the names of the files contained by the downloaded archive

This is what I wrote up to now:

   from zipfile import ZipFile 
    from urllib import request 
    import hashlib
    def md5(fname):
        hash_md5 = hashlib.md5()
        with open(fname, "rb") as f:
            for chunk in iter(lambda: f.read(4096), b""):
                hash_md5.update(chunk)
        return hash_md5.hexdigest()
    url_datasets = 'http://files.grouplens.org/datasets/movielens/ml-25m.zip'
    datasets = 'datasets.zip'
    url_checksum = 'http://files.grouplens.org/datasets/movielens/ml-25m.zip.md5'
    request.urlretrieve( url_datasets, datasets)
    request.urlretrieve (url_checksum, checksum)
    checksum = 'datasets.zip.md5'
    with ZipFile(datasets, 'r') as zipObj:
     listOfiles = zipObj.namelist()
     for elem in listOfiles:
           print(elem)

So what I'm missing is a way to compare the checksum I computed with the one I downloaded and maybe I can create a function "printFiles" that checks the checksum and in the positive case prints the list of files.

Is there something else I can improve?

user289143
  • 113
  • 4

1 Answers1

1

Your code isn't actually making any of the requests.

from zipfile import ZipFile 
import hashlib
import requests

def md5(fname):
    hash_md5 = hashlib.md5()
    hash_md5.update( open(fname,'rb').read() )
    return hash_md5.hexdigest()

url_datasets = 'http://files.grouplens.org/datasets/movielens/ml-25m.zip'
datasets = 'datasets.zip'
url_checksum = 'http://files.grouplens.org/datasets/movielens/ml-25m.zip.md5'
checksum = 'datasets.zip.md5'

ds = requests.get( url_datasets, allow_redirects=True)
cs = requests.get( url_checksum, allow_redirects=True)

open( datasets, 'wb').write( ds.content )

ds_md5 = md5(datasets)
cs_md5 = cs.content.decode('utf-8').split()[0]
print( ds_md5 )
print( cs_md5 )

if ds_md5 == cs_md5:
    print( "MATCH" )

    with ZipFile(datasets, 'r') as zipObj:
        listOfiles = zipObj.namelist()
        for elem in listOfiles:
            print(elem)
else:
    print( "Checksum fail" )
Tim Roberts
  • 48,973
  • 4
  • 21
  • 30
  • I don't understand what you mean that my code isn't actually making any of the requests. I ran my code and I get the same checksum I get from yours – user289143 Apr 02 '21 at 20:52
  • Your code didn't make any HTTP requests. You never fetched the files. You imported `urllib.request`, but you never called it. If you actually have the file, then it was either left over from before, or there is code you didn't show us. – Tim Roberts Apr 02 '21 at 20:58
  • You're right. I don't know why, but I didn't copy that part in my question. I have updated my code – user289143 Apr 02 '21 at 21:03
  • OK. And did you see how I did the comparison in my code? The md5 value you get from the web site has the file name tacked on. You'll have to remove that to do the comparison. – Tim Roberts Apr 02 '21 at 21:05
  • It's the split part? Is there a way to know whether the name of file is present without opening the md5 file to check? – user289143 Apr 02 '21 at 21:06
  • Not sure what you mean, Have you looked at the contents of `dataset.zip.md5` to see what it looks like? It's only about 50 characters long. – Tim Roberts Apr 02 '21 at 21:16
  • I looked at the file and inside there is checksum and file name, so doing cs.content.decode without split will give a checksum fail since the two strings will be different. My question is: is there a way to avoid this kind of error without having to look at the contents of dataset.zip.md5? – user289143 Apr 02 '21 at 21:23
  • 1
    You're making this way too complicated. All you have here are two strings to compare. It's an easy problem. Why are you opposed to the `split`? I suppose you could use `if cs_md5.startswith(ds_md5)`, but I don't think you've gained anything. – Tim Roberts Apr 02 '21 at 21:28