Algorithm to see if an address has been written in a sentence

Question

I want to make an algorithm which can see if an address is written in a sentence.

For instance, if a user writes:

"Hi, my address is Lincolnstreet 27, Foobarcity. Can you pick up the package there?"

And the user's address is Lincolnstreet 27, Foobarcity, then I want an algorithm that can detect that the address was mentioned in the sentence.

I already know the user's street name and number, zip code and city name.

It also needs to be fuzzy, in that people can make typos or make slight variations of their address that they wrote in the sentence. However, it's not required that the algorithm catches all occurences always no matter how mistyped they are, since that is obviously impossible. It's okay with a semi-naive solution.

I looked into Levehnstein distance, but I can't figure out how to make it work for this exact scenario. I also looked into Longest Common Subsequence, and it's the same problem there.

Any ideas? I don't necessarily care about the programming language.

I am not interested in a neural net solution - I truly believe it should be solvable with a relatively naive algorithm - I just don't know where to start.

C# - but the language doesn't matter. Pseudocode is also fine - I'll just convert it. Thanks for asking though! — Mathias Lykkegaard Lorenzen, Jun 06 '19 at 13:57
You will surely need a database to check the sentences against. No algorith can make a distinction between pure text and text that contains information about possible addresses. — Jan, Jun 06 '19 at 14:09
Do you know the addresses already? or are you trying to find when a user mentions their address without knowing the address? — jsarbour, Jun 06 '19 at 14:16
I already have the user's address. I want to know if the user mentions his own address in a string. — Mathias Lykkegaard Lorenzen, Jun 07 '19 at 13:38

jsarbour · Accepted Answer · 2019-06-06T14:11:25.950

Taking the sentence as the larger string, you basically want to see the following:

street name is present
city name is present
street number is present

You can check for order if you care to, but you want it to be fuzzy, so we'll ignore that for now. It may be prudent to check for overlap, which you can do by looking at the start and end of the substrings and comparing them.

Your language of choice almost certainly has some sort of .contains() function, and it likely has a fuzzy mode.

In which case,

if (sentence.roughly_contains(streetname) and sentence.roughly_contains(cityname) and sentence.contains(streetnumber)) {
    return true;
}

if you can't find a fuzzy match function, write one! Fuzzy Text Matching C# provides us with https://blogs.msdn.microsoft.com/toub/2006/05/05/generic-levenshtein-edit-distance-with-c/ which gives us a good generic implementation of fuzzy search that you can use to make a .roughly_contains() function.

Order wise; the check roughly follows the pattern:

//where all string.[start|end] are integers, locations can be found trivially or with the help of google once you know their presence
overlap(string1, string2) {
    if (string1.start > string2.end || string1.end < string2.start) {
        return false;
    }
    else {
        return true
    }
}

(this is assuming that you know the addresses independently of the sentence)

How would I write the function `roughly_contains`? That's the tricky part of this. It's easy to fuzzy search for a word in a string, but to fuzzy search a sentence within a sentence is hard. — Mathias Lykkegaard Lorenzen, Jun 07 '19 at 13:39
May you please clarify what your input is? Are you receiving paragraphs or individual sentences? — jsarbour, Jun 07 '19 at 14:25
Your `roughly_contains` should take in two strings, a sentence (a long string) and a search term (streetname or cityname). It should then return a true or false value if it thinks that the search term is present within the sentence. You can set your "distance" within the fuzzy search implementation. — jsarbour, Jun 07 '19 at 14:59

Haroldo_OK · Answer 2 · 2019-06-07T13:52:38.827

0

This is slightly more complicated than what you want, but the answers there could certainly help you: How to parse freeform street/postal address out of text, and into components

A very naive way to solve the problem, at least partially, would be to split both the sentences and the adress into words, and then, for each word on the address, check to which words on the string it is the most similar to. Then average the scores, and check if the average is above a threshold. Of course, this does not account for position, nor semantics.

edited Jun 07 '19 at 13:52

answered Jun 06 '19 at 14:18

Haroldo_OK

6,612
3
43
80

2

That's not what I am looking for. Please read the question carefully. – Mathias Lykkegaard Lorenzen Jun 07 '19 at 13:38
1

I see... I added a suggestion for a naive solution. – Haroldo_OK Jun 07 '19 at 14:02

score -1 · Answer 3 · answered Jun 06 '19 at 14:07

-1

I wonder if you couldn't just break it up into each sentence. Feed it into a search engine like Google, and see what kind of links it kicks back including suggested corrections / what search term it's actually showing results for. A bit heavy on the internet usage but I think it could work.

answered Jun 06 '19 at 14:07

user11609519

11

That is total over-engineering in my opinion, but thanks for providing a suggestion either way. – Mathias Lykkegaard Lorenzen Jun 06 '19 at 14:42
That's actually the best way to do it. Use Algolia ! – Fattie Jun 07 '19 at 13:52
How is this even close to being the best way of doing it when I already know the address. There has to be a better way. – Mathias Lykkegaard Lorenzen Jun 09 '19 at 13:31

Algorithm to see if an address has been written in a sentence

3 Answers3