1

I need to create a search for name of people. I already saw the great content in here but I need something different.

Here's my requirement.

I've tried to use a phonetic search, but the name of people that I need to index are non English names. I believe that phonetic algorithm implemented by Apache Solr / Lucene are not valid for Portuguese words (my culture).

After that, I decided to search using ngrams. It seems to work, but I need to somehow compare how close what user typed looks like what Solr index have. I could not use score, because it use the number of times some word exist in all documents. So I need somehow to give a number (percentage for example) as a result of the comparisson, in other words, how close what user typed looks like the real name that I have on solr.

Ps: I will use this result in my application to use what user typed or continue with what exists on my Solr.

Sample:

ID    NAME
1     James Bond
2     James Bond Junior
3     Tony Mellord

The use could search for Jhames Bond and using Ngrams both 1 and 2 will match.

PS: I used English names just to clarify the scenario.

Is there any way to give the answer: How much what user typed looks like what I have indexed without use score? Let's say:

Jhames Bond looks like James Bond in 97% (for example)
Jhames Bond looks like James Bond Junior in 87%
Community
  • 1
  • 1
Thiago Custodio
  • 17,332
  • 6
  • 45
  • 90
  • I wonder, you have a couple questions in that post, right? One about could you get percentage of similarity and second about your field approach. Right? – Mysterion Apr 02 '14 at 06:59
  • @Mysterion Yeah. But the main question is about the percentage. – Thiago Custodio Apr 02 '14 at 12:39
  • All I could do is recommend you this article - http://wiki.apache.org/lucene-java/ScoresAsPercentages, not sure if this what you looking for – Mysterion Apr 02 '14 at 12:44
  • 1
    @Mysterion I'm not sure if it is what I'm looking for. I need somehow Solr / Lucene gives me how close what user typed is to what I have indexed. As I said, score could not be used since his formula will vary based on how much documents exists with some of the words I'm searching. Basically I need a output from a algorithm like dismax or ngram but something. I.e. percentage – Thiago Custodio Apr 02 '14 at 13:04
  • Oh, finally I've got what you asking for. Probably it's a good idea to update your question with that – Mysterion Apr 02 '14 at 13:22
  • @Mysterion done, any clues? – Thiago Custodio Apr 02 '14 at 13:32

1 Answers1

2

If you are happy with how you are querying, and just want to come up with the percentage, you could compare the query value with the value returned from the index, as a postprocessing step, using a Levenshtein distance.

There is an implementation of the Levenshtein distance algorithm in the Apache Commons: StringUtils.getLevenshteinDistance

The maximum possible distance would be the length of the longest string compared, so to get a percentage might look something like:

1-(StringUtils.getLevenshteinDistance(str1, str2) / Math.max(str1.length(), str2.length()));

Jaro-Winkler Distance (StringUtils.getJaroWinklerDistance) might also be a better algorithm to use, and a bit simpler since it is already normalized such that it could be presented as a percentage. It also seems to come out closer to the example values you have provided.

femtoRgon
  • 32,893
  • 7
  • 60
  • 87
  • A few days ago, i asked this question : I found the spellcheck accuracy. Do you know if spellcheck could be used for search names and/or fit my requirement? Honest I don't know if I should use spellcheck or not. Sorry for bother you again http://stackoverflow.com/questions/22506782/difference-between-spellcheck-and-phonetic-search-apache-solr – Thiago Custodio Apr 03 '14 at 02:30
  • Possibly, but spell check relies on having a dictionary. If your index can act as a dictionary of names for you (or you have some other dictionary of standard names) then it could be quite useful for this, yes. Since spell check is usually based on the popularity of terms in the corpus, and names tend to be less frequent terms, it would just depend on the makeup of your index. – femtoRgon Apr 03 '14 at 03:36