Comparing files in directory to each other with no repeated comparisons

Question

What I want to do is create a list of files to compare in a directory of N files. The end goal is to compare images to find duplicates regardless of the format. Given the files 1.jpg 2.jpg 3.jpg.

Using this

import sys,os,time

def main(argv):
    list1 = os.listdir(argv[0])
    list2 = os.listdir(argv[0])

file_compare_list = []

for pic1 in list1:
    for pic2 in list2:
        file_compare_list.append([pic1,pic2])

print file_compare_list


if __name__ == "__main__":
    main(sys.argv[1:])

I get a list like this

[['1.jpg', '1.jpg'], #0
['1.jpg', '2.jpg'],  #1
['1.jpg', '3.jpg'],  #2
['2.jpg', '1.jpg'],  #3
['2.jpg', '2.jpg'],  #4
['2.jpg', '3.jpg'],  #5
['3.jpg', '1.jpg'],  #6
['3.jpg', '2.jpg'],  #7
['3.jpg', '3.jpg']]  #8

Now I could go through the file and be assured that each file will be compared but there are obvious duplicates. Index 0, 4, and 8 are easy to take care of I can compare them by file name and get rid of them. What I am more concerned with is stuff like index 2 and 6 where if I did something it would be a duplicate. Any help with this would be greatly appreciated.

score 6 · Accepted Answer · answered Jul 16 '11 at 23:04

6

You need itertools.combinations. This code prints exactly what you need:

import os, itertools

files = os.listdir("/path/to/files")
for file1, file2 in itertools.combinations(files, 2):
  print file1, file2

And some theory behind it: http://en.wikipedia.org/wiki/Combination

answered Jul 16 '11 at 23:04

tomasz

12,574
4
43
54

I basically reinvented this in my answer, +1 for knowing the right library to use. – Matt Ball Jul 16 '11 at 23:10
Awesome. Didn't know itertools had this particular one, though I've used that library before. Marking this question for future reference... – Jonathanb Jul 17 '11 at 04:13
@Jonathanb you're probably aware of it now, but you can find also cartesian product and permutations in itertools. All of them pretty useful and very self-descriptive when using in code. – tomasz Jul 17 '11 at 11:09
how can we change this in a way to compute say similar(file1, other) and instead of other we could have any other file in the directory but also have this for all the files? – Mona Jalal Oct 10 '16 at 21:35

score 4 · Answer 2 · answered Jul 16 '11 at 23:04

4

there is always itertools.combinations:

import itertools

my_list=['1.jpg','2.jpg','3.jpg']
my_combinations = [x for x in itertools.combinations(my_list,2)]

my_combinations will be:

[('1.jpg', '2.jpg'), ('1.jpg', '3.jpg'), ('2.jpg', '3.jpg')]

answered Jul 16 '11 at 23:04

Nate

12,499
5
45
60

Ah yes, it seems as though I've rewritten `itertools.combinations`. +1 – Matt Ball Jul 16 '11 at 23:07
yeah, but that's nothing to be ashamed of. Using itertools for this is a little bit like using a gun to swat a fly. ;^) – Nate Jul 16 '11 at 23:10
I believe pretty darn strongly in "don't reinvent the wheel." Especially with Python, the batteries really are included, so you might as well use 'em rather than wiring up a potato. – Matt Ball Jul 16 '11 at 23:18
An additional suggestion is to create a dict of images where the key is the image width and height and the value is a list (or set) of the names of those images of that size. That can reduce the number of full comparisons you make. – MRAB Jul 16 '11 at 23:19

score 3 · Answer 3 · edited May 23 '17 at 10:34

How's this for a hint?

Instead of computing all off-diagonal elements of the comparison matrix P x P:

P = {A, B, C, D, ...}

  + A + B + C + D + ...
A |   | * | * | * | ...
B | * |   | * | * | ...
C | * | * |   | * | ...
D | * | * | * |   | ...
  |   |   |   |   |

you can compute either the upper triangle:

  + A + B + C + D + ...
A |   | * | * | * | ...
B |   |   | * | * | ...
C |   |   |   | * | ...
D |   |   |   |   | ...
  |   |   |   |   |

or the lower triangle:

  + A + B + C + D + ...
A |   |   |   |   | ...
B | * |   |   |   | ...
C | * | * |   |   | ...
D | * | * | * |   | ...
  |   |   |   |   |

(from this answer of mine)

Apologies if that was too obtuse. Some actual code:

>>> list = ['a', 'b', 'c', 'd', 'e']
>>> pairs = [[x,y] for i, x in enumerate(list) for y in list[i+1:]]
>>> print pairs
[['a', 'b'], ['a', 'c'], ['a', 'd'], ['a', 'e'], ['b', 'c'], ['b', 'd'], ['b', 'e'], ['c', 'd'], ['c', 'e'], ['d', 'e']]

how can we change this in a way to compute say similar(file1, other) and instead of other we could have any other file in the directory but also have this for all the files? — Mona Jalal, Oct 10 '16 at 21:47

score 2 · Answer 4 · answered Jul 16 '11 at 23:04

2

Check out what this does and adapt to your problem:

[(x, y) for x in a for y in a if x < y]

answered Jul 16 '11 at 23:04

Ray Toal

86,166
18
182
232

Comparing files in directory to each other with no repeated comparisons

4 Answers4