I have a dataset of many UK addresses that I would need to parse and for each address, output different (acceptable) variants of the address.
Firstly, I'm trying to find out if I can reduce the problem to a simpler way (maybe using an already available library)? If not available, I'm trying to find out a method of using different Python/R functions to parse each address input, and produce acceptable outputs.
For example:
(Actual address) Flat 24a Ardshiel Avenue, Drum Brae, Edinburgh EH4 7HP
Acceptable variants that the Python/R code should output:
*Flat24a Ardshiel Avenue, Drum Brae, Edinburgh EH4 7HP
*F24a Ardshiel Avenue, Drum Brae, Edinburgh EH4 7HP
*f24a Ardshiel Avenue, Drum Brae, Edinburgh EH4 7HP
*f24 a Ardshiel Avenue, Drum Brae, Edinburgh EH4 7HP
The following sets are also acceptable variant outputs
(with postcodes written without space):
*Flat24a Ardshiel Avenue, Drum Brae, Edinburgh EH47HP
*F24a Ardshiel Avenue, Drum Brae, Edinburgh EH47HP
*f24a Ardshiel Avenue, Drum Brae, Edinburgh EH47HP
*f24 a Ardshiel Avenue, Drum Brae, Edinburgh EH47HP
Some other forms of address that needs to be parsed, allowing different variants to be identified:
161-163 Newhaven Road, Edinburgh EH6 4QA
49a Torphin Road, Edinburgh EH13 0PQ
23/27 Gylemuir Road, Edinburgh EH12 7UB
The code should be able to output different variants (maybe 5 or 6 different types) for every single address that's input for parsing. The focus should be given to the first two lines of the address (because this is where people usually shorten words/ simplify the address) - the rest of the address lines that indicate City/Town/County/Country (may need to be parsed only to provide lowercase possibilities).
The structure and form of the address becomes more important because it should output "f24" as people may write f24
for Flat 24
and maybe even spaces in between. Is this possible with Python/R based on Regular expressions and does anyone have any sample that they've worked before ?
UPDATE: One simpler form of a use case that I could think of, is having a rule based parser. For example, every UK address is structured in a manner where there's a comma separating each line in the address. Hence, a rule can be applied, up until each comma comes to an end. After which, the next rule executes and processes the next line, until the comma comes to an end.
Flat 24a Ardshiel Avenue, Drum Brae, Edinburgh EH4 7HP
Rule 1 = *Flat 24a Ardshiel Avenue*
The acceptable variant outputs that should be provided by the parser is:
1) Flat 24a Ardshiel Avenue (The actual line itself)
2) Flat 24a, Ardshiel Avenue (With a comma)
3) Flat24a Ardshiel Avenue
4) F24a Ardshiel Avenue
5) f24a Ardshiel Avenue
6) f24 Ardshiel Avenue
7) Flat24a Ardshiel Ave
8) F24a Ardshiel Ave
9) f24a Ardshiel Ave
10) f24 Ardshiel Ave
Rule 2 = *Drum Brae*
The acceptable variant outputs that should be provided by the parser is:
Since, not many variants can be produced with these two seperating words, maybe
and acceptable variant could be:
1) Drum Brae (The actual line itself)
2) DrumBrae (Assuming that people can still denote Street names in this way)
Rule 3 = *Edinburgh EH4 7HP*
The acceptable variant outputs that should be provided by the parser is:
1) Edinburgh EH4 7HP (The actual line itself)
2) Edinburgh EH47HP
At the end, each of the output (pieces) should be appended together to form the correct address syntax.
I am trying to see if there's a library that could be leveraged or if someone could help me with a rule based parser/using regular expressions to get the above solved.
UPDATE 2: Would writing many IF-ELSE statements be helpful to solve this problem ? If yes, are there anyone who could please help share some similar code samples that I could start with ?