0

I'm trying to write a program somewhat similar to old virus scanners, where I walk from the root directory of a system and find the md5 checksum for each file, and then compare it to a checksum inputted by the user. I'm running into an issue where while the script walks through the file system and reads then converts the data into the md5 checksum, certain files will basically stall the program indefinitely with no error message.

I have a try/except to check whether the file is readable before I try reading the file and getting the md5, and every time the program stalls, it'll be on the same files, and it will get stuck on the f.read() function. My code is below and I appreciate any help. I'm using the try/except the way that the os.access() documentation suggests.

def md5Check():
    flist = []
    md5list = []
    badlist = []
    for path, dirname, fname in os.walk(rootdir):
        for name in fname:
            filetype = os.access(os.path.join(path, name), os.R_OK)
            item = os.path.join(path, name)
            try:
                ft = open(item, 'rb')
                ft.close()
                if filetype== True:
                    with open(item, 'rb') as f:
                         md5obj = hashlib.md5()
                         fdata = f.read()
                         md5obj.update(fdata)
                         md5list.append(md5obj.hexdigest())
                         print(f'try: {item}')
            except (PermissionError, FileNotFoundError, OSError, IOError):
                badlist.append(item)
                print(f'except:{item}')

Also, keep in mind that the functionality for comparing a user-entered checksum is not yet coded in, as i cant even get a full array of md5 checksums to compare to, since it stalls before walking the whole filesystem and converting them to md5

I've also tried using the os.access() method with os.R_OK, but the documentation says that its an insecure way to check for readability, so i opted for the simple try/except with open() instead

Lastly, this program should run the same on windows/linux/macos (and everything so far does, including this persistent issue), so any OS specific solutions wont work for this particular project

Any help is appreciated!

Daniel Walker
  • 6,380
  • 5
  • 22
  • 45
Ryan B
  • 1
  • 1
  • ```filetype``` is never defined. Also you should use ```with``` context manager to read your files. And don't read and close and read again. Just read it once. – Loïc Jun 18 '22 at 19:30
  • @Loïc I edited my code slightly when posting, but filetype is defined when i run it, and is really just a secondary check as to whether or not the file is readable. I updated the code snipped to reflect such, but the issue im running into is the script stalling on f.read() on particular system files like /dev/.null and a handful of files in /sys/kernel – Ryan B Jun 18 '22 at 19:34
  • You probably shouldn't be reading files in `/dev` (do you really want to compute a checksum for your entire hard drive?), `/sys`, or `/proc`. All of these directories contain things that aren't actually files. – larsks Jun 19 '22 at 03:03
  • Is it possible that the files in question are simply very large files and it just appears to hang, or perhaps because you have run out of ram? – Alexander Jun 19 '22 at 03:10
  • @larsks I sort of figured, since most of the time my program is erroring out or stalling its from one of those files. Other than hard coding an if statement to avoid these files, is there any way i could skip over these files + similar files in windows with a command from some library? I want the program to run cross platform with no issues, so is there a particular rule or element i can check the folders for that will identify them as system files like that? – Ryan B Jun 19 '22 at 03:16
  • @alexpdev The files it was stalling on was things like /dev/.null and other weird system files, and as of right now i have it set so that theres a try/except statement & a subprocess that handles the reading so that i can time it out after like 20 seconds if its not being read and continue walking the system without stalling my program. I dont think that theyre too big, since i left it for like 25 minutes at one point to see if that was the case and it just stayed stalled. I thought ram might also be the problem, but for troubleshooting i hard coded to skip these files and it went on fine – Ryan B Jun 19 '22 at 03:20

1 Answers1

1

I think the primary cause of your problem is coming from using os.access; i.e. calling os.access("/dev/null", ...) is what is causing your program to hang.

In order to avoid attempting to get the hash of a symlink or a device file descriptor, or some other unreadable file system type you should check while traversing each item, to see if the target is in fact a file.

...
for name in fname:
    fullname = os.path.join(path, name)
    if os.path.isfile(fullname):
        try:
            with open(fullname, 'rb') as f:
                md5obj = hashlib.md5(f.read())
                md5list.append(md5obj.hexdigest())
                print(f'try: {fullname}')
         except (PermissionError, FileNotFoundError, OSError, IOError):
                badlist.append(fullname)
                print(f'except:{fullname}')

If that method doesn't work for you another option is to use pathlib which is also cross-platform and has an OOP approach to dealing with the filesystem. I have tested this and confirmed it does return false on files such as dev/null

from pathlib import Path

for name in fname:
    fullname = Path(path) / name
    if fullname.is_file():
        try:
            md5obj = hashlib.md5(fullname.read_bytes())
            md5list.append(md5obj.hexdigest())
            ...
        ...
Alexander
  • 16,091
  • 5
  • 13
  • 29
  • I tried using the os.path.isfile exactly as you described, but for some reason some files will return true as if theyre readable but then stall the program when trying to be read. Files like '/dev/.null' and other random system files. As of right now i changed my code to use the multiprocess library and using a subprocess with a timeout of 15 seconds so that if a file fails to read in 15 seconds, it will time out that subprocess and allow my program to keep running. I'm still not sure why certain files return true to os.path.isfile and then stall the program though – Ryan B Jun 19 '22 at 04:17
  • @RyanB /dev/.null always returns false for me.... Interesting. – Alexander Jun 19 '22 at 05:54
  • @RyanB After some further testing I think I discovered the problem... please re-read my full answer. – Alexander Jun 19 '22 at 06:17
  • In case you have files that are larger than your RAM or if you want to restrict the memory footprint of your program, I would avoid reading in the entire file at once, but instead follow https://stackoverflow.com/questions/1131220/get-md5-hash-of-big-files-in-python. Maybe a part of the stalling problem could be due to memory swapping. – Carlos Horn Jun 19 '22 at 09:05
  • @CarlosHorn I totally agree but that is way beyond the scope of the OPs question. Great tip though, thanks. – Alexander Jun 19 '22 at 09:13