I have many images(about 10000). The my goal is make the binary research on a the set the matrixs bidimensional and researching if there are images duplicate and delete this images. But exist the concept the matrix major another matrix? How i can solve? The alternative is make a research sequential, but is many innefficient.
Asked
Active
Viewed 61 times
0
-
1hash each matrix, then you don't even need binary search to find duplicates – Miki May 23 '19 at 10:52
1 Answers
0
@Miki's suggestion seemed like a fun exercise, so I created an implementation that you can use.
More on hashing here
import hashlib, os, cv2
# location of images
path = '.'
# create list that will hold the hashes
all_hashes = []
# get and iterate all image paths
all_files = os.listdir(path)
for f in all_files:
# check image extension
name, ext = os.path.splitext(f)
if ext == '.jpg':
# open image
img = cv2.imread(f)
# hash the image and get hex representation
hash = hashlib.md5(img).hexdigest()
# check if hash already exists, if not then add it to the list
if hash in all_hashes:
print('Already exists: ' + f)
else:
all_hashes.append(hash)

J.D.
- 4,511
- 2
- 7
- 20
-
Thanks a lot for the answer. But since I use this function as a black box, are we sure that it will output a unique result for each image? Another doubt, and if two images are just the same, but of a different size, will the result of the md5 function be the same? And if instead 2 images are very similar to each other, but only some nuances of color change? – Francesco Ladogana May 27 '19 at 22:20
-
Google can give far better explanations of hash function than me ;) In short: the hash-function takes all the pixel values and calculates a fixed length output. It is designed so that a small difference in the input will create a large difference in the output. 1 pixel difference between images will generate very different hashes. Different sizes of images have different amounts of pixels and will generate different hashes. Only images with the exact same array of pixel values will generate the same hash, regardless of filename. You can do some tests to assure yourself. – J.D. May 28 '19 at 11:32
-
Theoretically, 2 completely different images can generate the same hash, but the odds of that happening are practically impossible. You would have to hash many billions of files, as you can read in this funny and interesting [answer](https://stackoverflow.com/a/288519). – J.D. May 28 '19 at 11:35