0

In a system we fetch emails automatically and save the attachments in these emails in database. Now the customer want to be able to not save certain images, like banners and such that get saved over and over again.

I need a way to create a "blacklist" of images in the database and compare these images to the incoming attachments.

this is how the attachments are saved to database.

   ....
   InputStream is = new BufferedInputStream(new FileInputStream(attachment));
   preparedStatement.setBinaryStream(5,is,(int)filesize);
   ....
   pstmt.executeUpdate(); 

In the database they get saved as image and looks like 0xFFD8FFE000104A46494600010100000100010000....

What would be an easy way to read a few such images from database and see if any of them are identical to the incoming attachment?

Note that this is a rather complex system that I will not be able to rebuild at this time. So any advice about storing images in folders instead of in database or something similar will not be helpful to me right now.

Johannes
  • 135
  • 1
  • 2
  • 10

3 Answers3

1

I would recommend you to use a image hasher like LIRE. With this library, you can obtain a hash and then compare them (euclidean distance). Taking similarity between images into account, you can discard images that are not equal but really simmilar. Here is the link with the explanation:

https://blog.mayflower.de/1755-Image-similarity-search-with-LIRE.html

And here is the link with the code:

https://github.com/aoldemeier/ImageSimilarityWithLIRE

JavierV
  • 396
  • 1
  • 4
0

Do not compare the images directly, compare hash codes. If you use a hashing function like http://de.wikipedia.org/wiki/SHA-2 you can be very confident (*) that there are not collisions and you will blacklist the right images.

The basic idea is: While reading the Image, also compute it's hash code using MessageDigest

MessageDigest digest = MessageDigest.getInstance("SHA-256");

// call digest.update(byte[]) for all the chunks of the file

byte[] hash = digest.digest();

You can then compare the hash. If you convert it to a Base64 String before storing it to the database, you can use a normal String comparison in your SQL statement or in your Java code:

import org.apache.commons.codec.binary.Base64;

byte[] encodedBytes = Base64.encodeBase64(hash);
System.out.println("encodedBytes " + new String(encodedBytes));

Note: Your blacklist will probably still not work as you intend it. Users will just have to slightly change a single pixel of the picture and you will not find it in your blacklist anymore. You would probably compare images for similarity. And this is a lot harder and more time consuming.

See also:
How to hash some string with sha256 in Java?
Base64 Encoding in Java
Getting a File's MD5 Checksum in Java

(*) As in, the chances of a false positive are so low, don't even bother to think about it.

Community
  • 1
  • 1
David Tanzer
  • 2,732
  • 18
  • 30
  • Thank you But I dont really understand the part // call digest.update(byte[]) for all the chunks of the file What exactly should I digest? Can I do the file directly or an inputstream or something? – Johannes Apr 11 '14 at 07:20
  • @Johannes I have added a link to the "See Also" section where you can see how to create a digest of a file. Basically you have to read the whole file in chunks of 1024 (or 2048 or ...) bytes and update the digest object every time. Hope that helps. The code in the linke uses MD5, but it will work with SHA-256 too. – David Tanzer Apr 11 '14 at 07:32
0

Since the Image data type is a binary and huge space for storing data, IMO, the easiest way to compare Image fields is hash comparison. So you need to store hash of the Photo column on your table.

Images are stored in the database in the binary form , if you want to develop this comparison blacklist system then the best way would be to compare hashes. Basically you need to store hashes of all the images in a column from which you can compare any incoming image's hash. Comparing by name wouldn't be very efficient as name's might change.

Ajay Gupta
  • 464
  • 1
  • 5
  • 10