0

I have a dataset of many UK addresses that I would need to parse and for each address, output different (acceptable) variants of the address.

Firstly, I'm trying to find out if I can reduce the problem to a simpler way (maybe using an already available library)? If not available, I'm trying to find out a method of using different Python/R functions to parse each address input, and produce acceptable outputs.

For example:

(Actual address) Flat 24a Ardshiel Avenue, Drum Brae, Edinburgh EH4 7HP

Acceptable variants that the Python/R code should output:

*Flat24a Ardshiel Avenue, Drum Brae, Edinburgh EH4 7HP
*F24a Ardshiel Avenue, Drum Brae, Edinburgh EH4 7HP
*f24a Ardshiel Avenue, Drum Brae, Edinburgh EH4 7HP
*f24 a Ardshiel Avenue, Drum Brae, Edinburgh EH4 7HP

The following sets are also acceptable variant outputs 
(with postcodes written without space):

*Flat24a Ardshiel Avenue, Drum Brae, Edinburgh EH47HP
*F24a Ardshiel Avenue, Drum Brae, Edinburgh EH47HP
*f24a Ardshiel Avenue, Drum Brae, Edinburgh EH47HP
*f24 a Ardshiel Avenue, Drum Brae, Edinburgh EH47HP

Some other forms of address that needs to be parsed, allowing different variants to be identified:

161-163 Newhaven Road, Edinburgh EH6 4QA
49a Torphin Road, Edinburgh EH13 0PQ
23/27 Gylemuir Road, Edinburgh EH12 7UB

The code should be able to output different variants (maybe 5 or 6 different types) for every single address that's input for parsing. The focus should be given to the first two lines of the address (because this is where people usually shorten words/ simplify the address) - the rest of the address lines that indicate City/Town/County/Country (may need to be parsed only to provide lowercase possibilities).

The structure and form of the address becomes more important because it should output "f24" as people may write f24 for Flat 24 and maybe even spaces in between. Is this possible with Python/R based on Regular expressions and does anyone have any sample that they've worked before ?

#

UPDATE: One simpler form of a use case that I could think of, is having a rule based parser. For example, every UK address is structured in a manner where there's a comma separating each line in the address. Hence, a rule can be applied, up until each comma comes to an end. After which, the next rule executes and processes the next line, until the comma comes to an end.

Flat 24a Ardshiel Avenue, Drum Brae, Edinburgh EH4 7HP

Rule 1 = *Flat 24a Ardshiel Avenue*
The acceptable variant outputs that should be provided by the parser is:

1) Flat 24a Ardshiel Avenue (The actual line itself)
2) Flat 24a, Ardshiel Avenue (With a comma)
3) Flat24a Ardshiel Avenue
4) F24a Ardshiel Avenue
5) f24a Ardshiel Avenue
6) f24 Ardshiel Avenue
7) Flat24a Ardshiel Ave
8) F24a Ardshiel Ave
9) f24a Ardshiel Ave
10) f24 Ardshiel Ave


Rule 2 = *Drum Brae*
The acceptable variant outputs that should be provided by the parser is:
Since, not many variants can be produced with these two seperating words, maybe
and acceptable variant could be:

1) Drum Brae (The actual line itself)
2) DrumBrae (Assuming that people can still denote Street names in this way)


Rule 3 = *Edinburgh EH4 7HP*
The acceptable variant outputs that should be provided by the parser is:

1) Edinburgh EH4 7HP (The actual line itself)
2) Edinburgh EH47HP

At the end, each of the output (pieces) should be appended together to form the correct address syntax.

I am trying to see if there's a library that could be leveraged or if someone could help me with a rule based parser/using regular expressions to get the above solved.

UPDATE 2: Would writing many IF-ELSE statements be helpful to solve this problem ? If yes, are there anyone who could please help share some similar code samples that I could start with ?

Dinesh
  • 654
  • 2
  • 9
  • 31
  • I don't quite understand the meaning of different output variants and how it should be implemented. I can understand the problem, that there are different variants of input, which a function has to be capable to deal with. But different outputs? How should the function decide which type of output to return in what situation...? – SpghttCd Mar 30 '19 at 20:22
  • Relevant links: https://stackoverflow.com/questions/11456670/regular-expression-for-address-field-validation, https://stackoverflow.com/questions/9397485/regex-street-address-match – divibisan Mar 30 '19 at 20:37
  • @SpghttCd, what I meant by different output variants is; assuming that there is one specific "correct" way that a UK address is written (I would refer to how postal addresses are printed on letters, for e.g.) - but many people would write addresses in multiple different ways. They would write "f8" (instead of writing Flat 8) and they could write "Queen St" instead of "Queen Street" in the second line of the address. If a system stores many of these different variants, it becomes very cumbersome to check matching addresses, (as they're being stored in many different forms in a system). – Dinesh Mar 30 '19 at 22:35
  • The code will be implemented into an already existing process automation - (which will automate the reading of all the addresses and the Python/R code should produce acceptable output variants to the Input address). Input addresses > Python/R script or code > Output variants of the addresses – Dinesh Mar 30 '19 at 22:44
  • Here's a regex for validating just the postcode: https://stackoverflow.com/questions/164979/uk-postcode-regex-comprehensive – divibisan Apr 01 '19 at 16:30
  • The general consensus on this seems to be that natural language parsing of addresses is too big of a problem for regex and that if you want a solution that won't just create more work and problems for yourself (troubleshooting and fixing mis-parsed addresses after the fact), you should find a pre-existing address validation/verification system (something like this https://www.easypost.com/uk-address-verification) and use that. Unfortunately, software recommendation questions are off-topic here, so we can't help with that – divibisan Apr 01 '19 at 16:34

1 Answers1

-1

Recognizing postal addresses is a hard problem, as an address is whatever a human is expected to understand in a context.

Do you understand this address from Romania? bl. l1 scara 3. aleea putna manastur cluj?

Generating variants is the same as recognizing in the general case.

This said I'd choose some NLP and data based solution like this one for the general case.

Do you have a simpler use case?

Mihai Andrei
  • 1,024
  • 8
  • 11
  • The only option I have is that, they are all UK addresses and it will follow the format of a UK styled postal address. Unfortunately, I don't have a simpler use case than the one I've described. This is understandably an immense challenge for me at the moment. – Dinesh Mar 30 '19 at 22:39
  • The libpostal solution is parsing an input address into different categories (found within the address). I need to take an addresses and be able to throw out different variants of that address. But the focus would be more on the first two lines of the address; (because the first two lines of the address are the areas where people shorten words like "flat to f" and the second line of Street to "St", and sometimes people add unnecessary space in between type of accomodation and door number (hence, the output variant(s) needs to show all these possibilities. – Dinesh Mar 30 '19 at 22:51