Set-up
I am using Scrapy to scrape housing ads. Per ad I retrieve a postal code which consists of four numbers followed by 2 letters, e.g. 1053ZM
.
I have a excel sheet linking districts to postal codes in the following way,
district postcode_min postcode_max
A 1011AB 1011BD
A 1011BG 1011CE
A 1011CH 1011CZ
So, the second row states that postcodes ranging from 1011AB, 1011AC,..., 1011AZ, 1011BA,...,1011BD
belong to district A
.
The actual list contains 1214 rows.
Problem
I'd like to match each ad with its respective district, using its postal code and the list.
I am not sure what would be the best way to do this, and how to do this.
I've come up with two different approaches:
- Create all postcodes between
postcode_min
andpostcode_max
, assign all postcodes and their respective districts to a dictionary to subsequently match using a loop.
I.e. create,
d = {'A': ['1011AB','1011AC',...,'1011BD',
'1011BG','1011BH',...,'1011CE',
'1011CH','1011CI',...,'1011CZ'],
'B': [...],
}
and then,
found = False
for distr in d.keys(): # loop over districts
for code in d[distr]: # loop over district's postal codes
if postal_code in code: # assign if ad's postal code in code
district = distr
found = True
break
else:
district = 'unknown'
if found:
break
- Make Python understand there is a range between the
postcode_min
and thepostcode_max
, assign ranges and their respective districts to a dictionary, and match using a loop.
I.e. something like,
d = {'A': [range(1011AB,1011BD), range(1011BG,1011CE),range(1011CH,1011CZ)],
'B': [...]
}
and then,
found = False
for distr in d.keys(): # loop over districts
for range in d[distr]: # loop over district's ranges
if postal_code in range: # assign if ad's postal code in range
district = distr
found = True
break
else:
district = 'unknown'
if found:
break
Issues
For approach 1:
- How do I create all the postal codes and assign them to a dictionary?
For approach 2:
I used range()
for explanatory purpose but I know range()
does not work like this.
- What do I need to effectively have a
range()
as in the example above? - How do I correctly loop over these ranges?
I think my preference lies with approach 2, but I am happy to work with either one. Or with another solution if you have one.