1

I am developing webservice in Java using REST framework.

I am using MySQL 5.1 database as a backend.

I am performing search operation on one of my table say Stops using like pattern.

But now I want to perform "Approximate_string_matching (fuzzy string searching)" for above search. Consider an e.g. for 23 ST stop, user can provide search string 23rd station, 23rd, 23 station, 23rd ST etc.

For this Approximate_string_matching algorithm I found the link http://en.wikipedia.org/wiki/Approximate_string_matching

But I dont know how to implement it.

Please guys help me to implement Approximate_string_matching algorithm in Java / MySQL?

Thank you in advance.

Deepu
  • 2,590
  • 19
  • 52
  • 74

2 Answers2

5

One thing you might want to look into would be the Levenshtein Distance Algorithm:

Levenshtein distance is a string metric for measuring the difference between two sequences.

The Apache Commons Lang has an implementation of this readily available. You could use the getLevenshteinDistance(CharSequence s, CharSequence t, int threshold) to get the strings which are approximately equal to the given string. The threshold would come in handy so that you would be able to discard words which are a certain distance away from your source word, thus avoiding unneeded computation.

A better approach would be to use the Levenshtein function provided by MySQL iteself. A simple example of how to execute can be seen here.

Community
  • 1
  • 1
npinti
  • 51,780
  • 5
  • 72
  • 96
  • Thank you npinti for quick reply. I'll study about this and get back to you. – Deepu Oct 23 '12 at 05:48
  • Hi npinti, how can I implement Levenshtein Distance Algorithm in java? – Deepu Oct 23 '12 at 06:17
  • @Deepu: The Apache Commons already provides an implementation. If you want to get your hands dirty, the Wiki Page I have linked has a generic algorithm for the Levenshtein algorithm, so you can take a look at that. Please note that not doing this on your database will have some performance drawbacks since you will need to get all the data from the Database and then process it in your Java code. – npinti Oct 23 '12 at 06:20
1

As per your explanation it seems that whenever any user provides search string as 23rd station, 23rd, 23 station, or 23rd ST then the filtered output should be "23 ST stop", Right?

So I am assuming that all your stops names will be like XX YY stop where XX is a numerical value and YY is shortform for some station like ST, VT, MT etc

If that is right, then one way you may achieve this is by performing multiple filters such that output of first filter is input to the next filter. But before that you need to figure out "what to filter on"?

So in this particular case, it seems "23" is a must present substring in the start of the query string, so you need to extract the numeric part from your query string (you can use Java regex) apply the result as first filter, so in this case it will be:

 where stops like '23%'

then on output of this result you can apply next filter, and that next filter in this case could be the first two letter of the next word (if present) and applying its lower case for consistency, so in this case it would be 'st':

 where LOWER(stops) like '%st%'

Now you can achieve this in the query part itself by applying both the filters in the same query (try using subqueries) or you can bring in the resultset of first filter and apply the remaining filter on that resultset using Java regex.

sactiw
  • 21,935
  • 4
  • 41
  • 28
  • thanks sactiw for your explanations. You pointed the right circumstances. I'll try it in my code – Deepu Oct 23 '12 at 07:12