3

Hi and thanks for looking!

Update

For the sake of clarity, a third-party .NET library is just fine. Preferably an open-source or free one. The solution need not be native .NET.

Background

I am working on an enterprise web application for which the client has given us thousands of pages of content in MS Word documents that we have to parse, extract data, and send to the content database.

Within these docs are various embedded images representing a larger original image in a separate folder.

The client did not provide any paths to the original source image, so when we see content with an embedded image in the MS Word doc, we have to go through several "assets" folders and look for the corresponding image which is extraordinarily time consuming.

We are already using DocX to parse the documents, so you can assume that we have a list of bitmap images to loop through that we have pulled from the document.

Question

Given a list of bitmaps that we just extracted from the document, how do we search a different folder containing hundreds of images, for the matching image, and then return the file path to it?

TinEye.com does this over the web. I am wondering if, using System.Drawing or something, we can do it on a PC with C#.

Thanks!

Matt

Community
  • 1
  • 1
Matt Cashatt
  • 23,490
  • 28
  • 78
  • 111

3 Answers3

2

Hate to propose an answer to my own question, but I think I might be on to something here. Here is heuristic/pseudo code for a C# forms app--your thoughts are appreciated:

Part 1

  1. Using System.IO, traverse the "assets" folders and get all images.
  2. For each image, Base64 encode it.
  3. Take the resulting string and place in an XML file:
<Image>
     <Path>C:\SomePath</Path>
     <EncodedString>[Some Base64 String]<Encoded String>
</Image>

Now we have an XML file containing all original images, in Base64 form, along with their file path.

Part 2

  1. Using DocX, extract all images from MS Word Doc.
  2. For each image, use Linq-to-Xml to search for an exact match in the XML file from Part 1.
  3. If there are no exact matches, start iterating the XML file and computing the Levenshtein distance.
  4. While in the foreach store the XML node Id (or file path) and Levenshtein Distance as a key value pair in an object.
  5. Take the k/v pair with the lowest LD score and return the file path.
  6. For performance, set tolerance so that the foreach stops if a certain original image has an acceptably low LD score when compared to the image extracted from the document.

Since this is a one-off task, I don't need instant performance. So, I could run this tonight before leaving the office and, hopefully, come back tomorrow to a list of paths connecting the original images to the ones embedded in the docs.

UPDATE

The heuristic above worked beautifully! I ended up using the Sift library to efficiently calculate distances between Base64 strings. Specifically, I used their FastDistance() method. Having 100% accuracy on finding the images I need, even if the angle from which the photo was taken is slightly different.

Community
  • 1
  • 1
Matt Cashatt
  • 23,490
  • 28
  • 78
  • 111
  • Seems reasonable. Note that it is probably memory-bound. You might need to split it up into batches somehow. – Chris Shain Jan 06 '12 at 16:37
  • Good point. At most I have 60 images per doc, so I will just do it on a per doc basis. Testing the theory now. Thanks! – Matt Cashatt Jan 06 '12 at 16:40
  • What happens if the same image appears in multiple documents? – Chris Shain Jan 06 '12 at 16:43
  • Thanks Chris. It would be okay if there is a common image in multiple docs because I am simply returning a file path. Ultimately, I will send the local images to a CDN and then swap the local file paths out with a CDN path. Since, at that point, I will have an association between embedded images and original image file path, I can swap that out with the respective CDN address as well. Thanks! – Matt Cashatt Jan 06 '12 at 16:53
  • @ChrisShain-Ah crud. Hitting a System.OutOfMemory exception when I try to put the length of each Base64 string into int[length1,length2]. So, looks like you may be right about the memory issue. Still digging. . . – Matt Cashatt Jan 06 '12 at 17:12
  • How about storing the Base64 version of the image on disk, and only keeping an MD5 hash of the Base64 value in memory? A perfect match on the hash would be a trigger to go and retrieve the file from disk for comparison on the full Base64. – Chris Shain Jan 06 '12 at 17:14
  • @ChrisShain- Thanks. Will look into this after a barrage of meetings. Cheers and THANK YOU! – Matt Cashatt Jan 06 '12 at 17:32
0

There is no built-in algorithm in the .NET framework for generating image similarity. You'd need to use a third-party library or do it yourself. Lots of image similarity algo questions on SO:

Algorithm for finding similar images

How can I measure the similarity between two images?

comparing images programmatically - lib or class

One more, for .NET: Are there any OK image recognition libraries for .NET?. This one refers you to AForge, which seems to have the algorithm that you are after.

Community
  • 1
  • 1
Chris Shain
  • 50,833
  • 6
  • 93
  • 125
  • I'll add nother similar question: http://stackoverflow.com/questions/225210/removing-duplicate-images – hometoast Jan 06 '12 at 15:32
  • Thanks to you both, but none of these links really help. That may be the reason why the question keeps surfacing in various manifestions on SO. I don't care whether this happens in native .NET, a library is just fine. And, I probably can't pursue coding a library myself at the moment so all of the links above that discuss algorithmic theory aren't helpful, particularly the ones that aren't even covering .NET. Thanks anyway though. – Matt Cashatt Jan 06 '12 at 15:48
  • See my edit for another one you might be able to use more easily. – Chris Shain Jan 06 '12 at 15:50
  • Ah yes, I forgot about AForge, thank you. I stumbled on that this past summer and the library has tons of potential for various things. I just had an idea while pouring a cup of coffee. I am going to try it and see if it works before jumping into AForge. Meanwhile upvote for AForge. Cheers! – Matt Cashatt Jan 06 '12 at 15:58
  • Well, up vote in a few hours when I have a new daily quota of votes ;). – Matt Cashatt Jan 06 '12 at 15:59
  • Chris, I just proposed an answer. I would really appreciate your input if you have time. Thanks! – Matt Cashatt Jan 06 '12 at 16:29
0

According to this SO answer to a similar question, you should look at OpenCV and VLFeat. The former has a C++ API and the latter a C API, so you would need to write your own P/Invoke wrapper or perhaps wrap them in a C++/CLI facade, which you could call from C#.

Community
  • 1
  • 1
dgvid
  • 26,293
  • 5
  • 40
  • 57
  • Thanks dgvid! I will look into those. In the meantime, I am going to propose an answer I just came up with that seems (to me) to be a bit more elegant. – Matt Cashatt Jan 06 '12 at 16:13