3

my sample content is below

content ="""
Dear Customer,

 Detail of service affected: 

 Bobs Builders
 Retail park 
 The Aavenue
 London
 LDN 4DX


 Start Time & Date: 04/01/2017 00:05 
 Completion Time & Date: 04/01/2017 06:00 

 Details of Work: 
 ....

Im already pulling out the postcode with

postcodes =  re.findall(r"[A-Z]{1,2}[0-9R][0-9A-Z]? [0-9][A-Z]{2}", content)

I would also like to get the City from this content, is that even possible? would i have to provide it with a list of Citys first? and then check against that?

or is there a way of getting the line before the postcode? as the addresses are always sent that way.

could i use the postcodes regex to get the word before the postcode?

Thanks

AlexW
  • 2,843
  • 12
  • 74
  • 156
  • Why don't you just parse your content line by line ? – bruno desthuilliers Dec 12 '16 at 13:14
  • how would i do that? what if the line no is not the same each time too? – AlexW Dec 12 '16 at 13:15
  • Does your content is always format the exactly the same way ? In this case you can read the word at the 8th line to get the city.... or if you have always the sentence " Detail of service affected:" before the address, you can take next lines... – Dadep Dec 12 '16 at 13:20
  • @AlexW assuming it's generated content (looks like an automatic email), so it surely has the same structure whatever the exact content. If you know the address starts on the first non empy line after the "Detail of service affected: " line then it's quite easy to parse the address block. Else (if it's not generated content and the content structure wildly vary from case to case) then I can't really help... – bruno desthuilliers Dec 12 '16 at 13:50

2 Answers2

2

Here's an example :

import re
postcodes =  re.findall(r"(\w+)\s+([A-Z]{3} \d[A-Z]{2})", content)

print postcodes
# => [('London', 'LDN 4DX')]

You get 2 groups, the first one is the word right before the postcode (possibly on another line), the second one is the postcode itself.

The postcode regex has been simplified in order to make the example more readable.

If you want to match any UK code, here is a good reference.

The regex you mentioned doesn't match LDN 4DX by the way. Adding a ? for [0-9R] would do :

postcodes =  re.findall(r"[A-Z]{1,2}[0-9R]?[0-9A-Z]? [0-9][A-Z]{2}", content)
Community
  • 1
  • 1
Eric Duminil
  • 52,989
  • 9
  • 71
  • 124
2

There are multiple ways to approach this problem:

1- Use Google API geolocation

If you can extract the address part by doing pattern matching, you can pass the address to Google Map Geocode API and let it parse the address for you.

2- Regex search

If you are sure that the address is always well-formatted, and postcode always precede by city name, you can use regex to handle these situations:

(\w*)\s+([A-Z]{3}\s+\d[A-Z]{2})

3- Use a database of city names

If the addresses are not always well-formatted, your best bet would use a database of city names such as OpenAddresses.

4- Use an entity extraction API [BEST]

This is a classic application of Information extraction in Natural Language Processing. You can implement your own using nltk, or even better you can use a web service such as AlchemyAPI. Copy and Paste your text in their demo and see how powerful it is by yourself.

bman
  • 5,016
  • 4
  • 36
  • 69