Group files by size, then find hash duplicates

Question

This follows on from my question yesterday: Finding duplicate files via hashlib?

I now realize I need to group the files into filesize. So let's say I had 10 files in a folder, but 3 of them were 50 bytes each, I would group the 3 files.

I've found that I can find the size in bytes of a file by using:

print os.stat(/Users/simon/Desktop/file1.txt).st_size

or:

print os.path.getsize(/Users/simon/Desktop/file1.txt)

Which is great. But how would I scan a folder using os.walk and list a group of files together using one of the methods above??

After that, I want to hash them via hashlib's MD5 to find duplicates.

Don't mean to brag, but my answer to your previous question does that. — tdelaney, Sep 11 '13 at 20:08
Yep, @tdelaney. Using a `defaultdict(list)` lookup table, like you did, is probably the best and simplest way. — wflynny, Sep 11 '13 at 20:11
I gave it a quick go and couldn't get it working yesterday but I have just got it working now. Thanks. Can you just explain a bit more about the following: 1) why (1024*1024) size not '5000000'. 2) what does this bit do exactly: size_map = defaultdict(list) 3) sys.argv[1] I'm guessing the sys.argv[1] just makes the python py.py 'filepath' argument work (where filepath is the argv[1] ? Sorry for all the questions! Thanks tdelaney — BubbleMonster, Sep 11 '13 at 20:14
@BubbleMonster - I'll answer these in the original post to reduce confusion. — tdelaney, Sep 11 '13 at 20:17

score 3 · Accepted Answer · answered Sep 11 '13 at 20:10

Sort the filenames by size, and then use itertools.groupby to group similar sized files together.

import os
import os.path
import itertools

#creates dummy files with a given number of bytes.
def create_file(name, size):
    if os.path.isfile(name): return
    file = open(name, "w")
    file.write("X" * size)
    file.close()

#create some sample files 
create_file("foo.txt", 4)
create_file("bar.txt", 4)
create_file("baz.txt", 4)
create_file("qux.txt", 8)
create_file("lorem.txt", 8)
create_file("ipsum.txt", 16)

#get the filenames in this directory
filenames = [filename for filename in os.listdir(".") if os.path.isfile(filename)]

#sort by size
filenames.sort(key=lambda name: os.stat(name).st_size)

#group by size and iterate
for size, items_iterator in itertools.groupby(filenames, key=lambda name: os.stat(name).st_size):
    items = list(items_iterator)
    print "{} item(s) of size {}:".format(len(items), size)
    #insert hashlib code here, or whatever else you want to do
    for item in items:
        print item

Result:

3 item(s) of size 4:
bar.txt
baz.txt
foo.txt
2 item(s) of size 8:
lorem.txt
qux.txt
1 item(s) of size 16:
ipsum.txt
1 item(s) of size 968:
test.py

score 1 · Answer 2 · answered Sep 11 '13 at 20:27

This sample code allows you to create a dictionary with size as keys and a list of files with the same size as values.

#!/usr/bin/env python

import os,sys
d = {}
gen = os.walk(os.getcwd())

for i in gen:
    dirname, dirlist, filelist = i
    for f in filelist:
        fullname = os.path.join(dirname,f)
        sz = os.path.getsize(fullname)
        if d.has_key(sz):
            d[sz].append(fullname)

        else:
            d[sz] = []
            d[sz].append(fullname)




print d

Group files by size, then find hash duplicates

2 Answers2