3

Say I have a string

local a = "Hello universe"

I find the substring "universe" by

a:find("universe")

Now, suppose the string is

local a = "un#verse"

The string to be searched is universe; but the substring differs by a single character. So obviously Lua ignores it.

How do I make the function find the string even if there is a discrepancy by a single character?

SatheeshJM
  • 3,575
  • 8
  • 37
  • 60
  • Check out the [Levenshtein distance](http://stackoverflow.com/questions/5859561/getting-the-closest-string-match/10356754#10356754) algorithm. – Mud Oct 19 '12 at 14:48

4 Answers4

5

If you know where the character would be, use . instead of that character: a:find("un.verse")

However, it looks like you're looking for a fuzzy string search. It is out of a scope for a Lua string library. You may want to start with this article: http://ntz-develop.blogspot.com/2011/03/fuzzy-string-search.html

As for Lua fuzzy search implementations — I haven't used any, but googing "lua fuzzy search" gives a few results. Some are based on this paper: http://web.archive.org/web/20070518080535/http://www.heise.de/ct/english/97/04/386/

Try https://github.com/ajsher/luafuzzy.

Alexander Gladysh
  • 39,865
  • 32
  • 103
  • 160
  • Eep! I did not realize this was a complex problem... Certainly not worth implementing this for my current requirement which is very trivial.. Although it looks like a very interesting topic.. might be useful some day :) Thanks for the reply! – SatheeshJM Oct 19 '12 at 07:27
4

It sounds like you want something along the lines of TRE:

TRE is a lightweight, robust, and efficient POSIX compliant regexp matching library with some exciting features such as approximate (fuzzy) matching.

Approximate pattern matching allows matches to be approximate, that is, allows the matches to be close to the searched pattern under some measure of closeness. TRE uses the edit-distance measure (also known as the Levenshtein distance) where characters can be inserted, deleted, or substituted in the searched text in order to get an exact match. Each insertion, deletion, or substitution adds the distance, or cost, of the match. TRE can report the matches which have a cost lower than some given threshold value. TRE can also be used to search for matches with the lowest cost.

A Lua binding for it is available as part of lrexlib.

Community
  • 1
  • 1
furq
  • 5,648
  • 3
  • 16
  • 21
  • Neat little library.. Unfortunately I can't integrate C libraries in my program. So that's out of the window :( – SatheeshJM Oct 19 '12 at 10:21
2

A simple roll your own approach (based on the assumption that the pattern keeps the same length):

function hammingdistance(a,b)
    local ta={a:byte(1,-1)}
    local tb={b:byte(1,-1)}
    local res = 0
    for k=1,#a do
        if ta[k]~=tb[k] then
            res=res+1
        end
    end
    print(a,b,res) -- debugging/demonstration print
    return res
end

function fuz(s,pat)
    local best_match=10000
    local best_location
    for k=1,#s-#pat+1 do
        local cur_diff=hammingdistance(s:sub(k,k+#pat-1),pat)
        if  cur_diff < best_match then
            best_location = k
            best_match = cur_diff
        end
    end
    local start,ending = math.max(1,best_location),math.min(best_location+#pat-1,#s)
    return start,ending,s:sub(start,ending)
end

s=[[Hello, Universe! UnIvErSe]]
print(fuz(s,'universe'))

Disclaimer: not recommended, just for fun:

If you want a better syntax (and you don't mind messing with standard type's metatables) you could use this:

getmetatable('').__sub=hammingdistance
a='Hello'
b='hello'
print(a-b)

But note that a-b does not equal b-a this way.

jpjacobs
  • 9,359
  • 36
  • 45
  • Note that messing up with standard types usually considered to be a bad style. – Alexander Gladysh Oct 19 '12 at 09:56
  • 2
    But it's fun! Only a pity that you can't make overwrite standard operators as to make 1+1 == 3 ;). But is indeed bad style. Disclaimer added. – jpjacobs Oct 19 '12 at 10:25
  • @jpjacobs Thanks for the code. Although the performance will get a hit big time(quadratic time?), won't it? And I will be comparing quite a few words, so... – SatheeshJM Oct 19 '12 at 10:28
  • It all depends. If it's parsing user input, no problem, the user will be slower. Else, just give it a try, and measure the performance hit. – jpjacobs Oct 19 '12 at 11:03
2

If you are really looking for a single character difference and do not care about performance, here is a simple approach that should work:

local a = "Hello un#verse"

local myfind = function(s,p)  
  local withdot = function(n)
    return p:sub(1,n-1) .. '.' .. p:sub(n+1)
  end
  local a,b
  for i=1,#s do
    a,b = s:find(withdot(i))
    if a then return a,b end
  end
end

print(myfind(a,"universe"))
catwell
  • 6,770
  • 1
  • 23
  • 21