2

I want to make an algorithm which can see if an address is written in a sentence.

For instance, if a user writes:

"Hi, my address is Lincolnstreet 27, Foobarcity. Can you pick up the package there?"

And the user's address is Lincolnstreet 27, Foobarcity, then I want an algorithm that can detect that the address was mentioned in the sentence.

I already know the user's street name and number, zip code and city name.

It also needs to be fuzzy, in that people can make typos or make slight variations of their address that they wrote in the sentence. However, it's not required that the algorithm catches all occurences always no matter how mistyped they are, since that is obviously impossible. It's okay with a semi-naive solution.

I looked into Levehnstein distance, but I can't figure out how to make it work for this exact scenario. I also looked into Longest Common Subsequence, and it's the same problem there.

Any ideas? I don't necessarily care about the programming language.

I am not interested in a neural net solution - I truly believe it should be solvable with a relatively naive algorithm - I just don't know where to start.

Mathias Lykkegaard Lorenzen
  • 15,031
  • 23
  • 100
  • 187

3 Answers3

3

Taking the sentence as the larger string, you basically want to see the following:

  • street name is present
  • city name is present
  • street number is present

You can check for order if you care to, but you want it to be fuzzy, so we'll ignore that for now. It may be prudent to check for overlap, which you can do by looking at the start and end of the substrings and comparing them.

Your language of choice almost certainly has some sort of .contains() function, and it likely has a fuzzy mode.

In which case,

if (sentence.roughly_contains(streetname) and sentence.roughly_contains(cityname) and sentence.contains(streetnumber)) {
    return true;
}

if you can't find a fuzzy match function, write one! Fuzzy Text Matching C# provides us with https://blogs.msdn.microsoft.com/toub/2006/05/05/generic-levenshtein-edit-distance-with-c/ which gives us a good generic implementation of fuzzy search that you can use to make a .roughly_contains() function.

Order wise; the check roughly follows the pattern:

//where all string.[start|end] are integers, locations can be found trivially or with the help of google once you know their presence
overlap(string1, string2) {
    if (string1.start > string2.end || string1.end < string2.start) {
        return false;
    }
    else {
        return true
    }
}

(this is assuming that you know the addresses independently of the sentence)

jsarbour
  • 928
  • 1
  • 8
  • 20
  • How would I write the function `roughly_contains`? That's the tricky part of this. It's easy to fuzzy search for a word in a string, but to fuzzy search a sentence within a sentence is hard. – Mathias Lykkegaard Lorenzen Jun 07 '19 at 13:39
  • May you please clarify what your input is? Are you receiving paragraphs or individual sentences? – jsarbour Jun 07 '19 at 14:25
  • Your `roughly_contains` should take in two strings, a sentence (a long string) and a search term (streetname or cityname). It should then return a true or false value if it thinks that the search term is present within the sentence. You can set your "distance" within the fuzzy search implementation. – jsarbour Jun 07 '19 at 14:59
0

This is slightly more complicated than what you want, but the answers there could certainly help you: How to parse freeform street/postal address out of text, and into components

A very naive way to solve the problem, at least partially, would be to split both the sentences and the adress into words, and then, for each word on the address, check to which words on the string it is the most similar to. Then average the scores, and check if the average is above a threshold. Of course, this does not account for position, nor semantics.

Haroldo_OK
  • 6,612
  • 3
  • 43
  • 80
-1

I wonder if you couldn't just break it up into each sentence. Feed it into a search engine like Google, and see what kind of links it kicks back including suggested corrections / what search term it's actually showing results for. A bit heavy on the internet usage but I think it could work.