0

I have made a Python script that uses the checksumdir (https://github.com/cakepietoast/checksumdir) library to calculate MD5 hashes based on a directory's content. Calculating this hash for a 350mb directory located on a mechanical harddrive takes a few seconds.

Calculating a hash for a 30gb directory however takes ages. I haven't finished it, I found 12+ hours to be too long anyway. I have no idea what may cause this, one thing I could think of is that a 350mb directory fits in my RAM memory, 30gb does not. Block size in checksumdir seems to be 64 * 1024 (65536) and from what I've found with Google this seems to be a reasonable number.

I also found that the 350mbdir contains 466 files, whereas the 30gb dir contains 22696 files. If I extrapolate that I still can't explain the excessive time needed though.

FWIW: I want to use the script to find directories with duplicate contents. I haven't found any application that does that. So I want to calculate hashes and display the end result in an HTML file.

Relevant code:

#!/usr/bin/env python3

import os
import re
from checksumdir import dirhash # https://pypi.python.org/pypi/checksumdir/1.0.5
import json
import datetime


now = datetime.datetime.now().strftime("%Y-%m-%d_%H:%M")
results = {}
sorted_results = {}
single_entries = []
compare_files = False
compare_directories = True
space_to_save = 0
html_overview = []
html_overview.extend(['<!DOCTYPE html>','<html>','<head>','<link rel="stylesheet" type="text/css" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css">','</head>','<body>','    <table style="width:90%" class="table table-hover table-striped">', '        <tr>','            <td colspan=4></td>','        </tr>'])

# Configuration
root = "/home/jeffrey/Documenten" # Root directory to start search
create_hash = True
calculate_file_folder_size = True
compare_file_folder_names = False
sort_by = "hash" # Options: hash, size, name
json_result_file = 'result_' + now + '.json'
html_result_file = "DuplicatesHtml-" + now + ".html"
only_show_duplicate_dirs = True
remove_containing_directories = True
verbose_execution = True

# Calculate size of directory recursively - http://stackoverflow.com/questions/1392413/calculating-a-directory-size-using-python
def get_size(start_path = '.'):
    total_size = 0
    for dirpath, dirnames, filenames in os.walk(start_path):
        for f in filenames:
            fp = os.path.join(dirpath, f)
            total_size += os.path.getsize(fp) / 1048576 # size from bytes to megabytes
    return total_size


# Calculate comparison properties, sort and save based on os.walk for recursive search
for dirName, subdirList, fileList in os.walk(root):
    for dir in subdirList:       
        dir_name = dir
        dir = os.path.join(dirName, dir)
        if dir[0] != ".":
            if verbose_execution = True:
                print(dir)
            if calculate_file_folder_size == True:
                size = get_size(dir)       
                if verbose_execution = True:
                    print(size)
            if create_hash == True:
                hash = dirhash(dir, 'md5')
                if verbose_execution = True:
                    print(hash)      
            results[dir] = [dir_name, size, hash]
Jeffrey
  • 105
  • 1
  • 1
  • 8

1 Answers1

1

Ok, so I found that 1 file was more or less just hanging the process. I found that out by using another Python function for calculating hashes with verbose output. When I deleted that file (I didn't need it, something in the AppData dir in Windows) everything worked fine. For future reference: around 900gb of data took half a day to process, using a second gen i5 and SATA connection. I suspect I/O would be the bottle neck here. But that's a time I would expect.

Jeffrey
  • 105
  • 1
  • 1
  • 8