3

I would like to create an application which searches for similar documents in its database; eg. the user uploads a document (text, image, etc.), and I would like to query my application for similar ones.

I have already created the neccesseary algorithms for the process (fingerprinting, feature extraction, hashing, hash compare, etc.), I'm looking for a framework, which couples all of these.

For example, if I would implement it in Lucene, I would do the following:

  • Create a custom "tokenizer" and "stemmer" (~ feature extraction and fingerprinting)
  • Than adding the created elements to the Lucene index
  • And finally using the MoreLikeThis class to find the similar documents

So, basically Lucene might be a good choice - but as far as I know, Lucene is not meant to be a document similarity search engine, but rather a term-based searchengine.

My question is: are the any applications/frameworks, which might fit for the above mentioned problem?

Thanks, krisy

UPDATE: It seems like the process I described above is called Content Based Media (Sound, Image, Video.) Retrieval.

There are many projects that use Lucene for this, see: http://wiki.apache.org/lucene-java/PoweredBy (Lire, Alike, etc.), but still didn't found any dedicated framework ...

krisy
  • 1,508
  • 1
  • 14
  • 32
  • Have a look at [answers of this question](http://stackoverflow.com/questions/1844194/get-cosine-similarity-between-two-documents-in-lucene), I think it addresses the same topic. – mindas May 03 '13 at 09:20
  • Thanks; it confirms that my original idea can be done in Lucene! :-) But are there any other frameworks specially designed for this task? – krisy May 03 '13 at 09:26
  • I've heard about [gensim](http://radimrehurek.com/gensim/) but this is for Python. Not sure if there's anything similar for Java. – mindas May 03 '13 at 09:30
  • Looks great; I'm looking for sometheing similar, yes! – krisy May 03 '13 at 09:31

2 Answers2

0

If I am getting correctly you have your own database, and you are searching if its duplicate, or copy/similar, in database while/after user uploads.

If That is the case, the domain is very big in comparison..

1) For Image you must use pattern matching, there are few papers available for image duplicate finder, on net, search for them you will get many options for that,

2) for Document there is again characteristically division

  1. DOC(x)
  2. PDF
  3. TXT
  4. RTF, etc..

Each document carry different property, now here Lucene may help you but its search engine,

While searching for Language pattern, there are many things we need to check, as you are searching for similar(not exact same).

So, fuzzy language program will come handy.

This requirement is too large that the forum page will not be enough to explain everything anyways, I hope this much will do

MarmiK
  • 5,639
  • 6
  • 40
  • 49
  • I do realize, it is a huge domain - right know I'm only interested in finding the best tool for the job. About the size of the forum page; this reminds me: "I have discovered a truly marvelous proof of this, which this margin is too narrow to contain" :-) – krisy May 14 '13 at 09:49
  • I have seen lucene but not much aware about that, so cna't say that, but I know using neural language, or fuzzy logic you nca find the duplicate, using neural language you will create a pattern, and using fuzzy you can analyse and match them. A tool SPSS is a statistical tool and works with text containing files only if I remember it correctly. its demo is free. – MarmiK May 14 '13 at 09:52
  • Perhaps this is a link to match two data sets using spss, if it relates `http://www.ats.ucla.edu/stat/spss/faq/update.htm` :) – MarmiK May 14 '13 at 09:57
  • SPSS is a very cool piece of software - but not a framework for my needs :-( – krisy May 14 '13 at 14:27
0

Since you're using Lucene, you might take a look at SOLR. I do realize it's not a dedicated framework for your purpose either, but it does add stuff on top of Lucene that comes in quite handy. Given the pluggability of Lucene, its track record and the fact that there are a lot of useful resources out there, SOLR might help you get your job done.

Also, the answer that @mindas pointed to, links to the blog post describing the technical details at how to accomplish your goal with SOLR (but you probably already read that in meantime).

Community
  • 1
  • 1
Grimace of Despair
  • 3,436
  • 26
  • 38
  • Solr looks nice - will take a closer look! The blog post - yes, I read that - looks great; my own idea is somewhat similar. thanks! – krisy May 14 '13 at 09:45