2

Set-up

I am using Scrapy to scrape housing ads. Per ad I retrieve a postal code which consists of four numbers followed by 2 letters, e.g. 1053ZM.

I have a excel sheet linking districts to postal codes in the following way,

district    postcode_min    postcode_max
   A           1011AB           1011BD
   A           1011BG           1011CE
   A           1011CH           1011CZ

So, the second row states that postcodes ranging from 1011AB, 1011AC,..., 1011AZ, 1011BA,...,1011BD belong to district A.

The actual list contains 1214 rows.


Problem

I'd like to match each ad with its respective district, using its postal code and the list.

I am not sure what would be the best way to do this, and how to do this.

I've come up with two different approaches:

  1. Create all postcodes between postcode_min and postcode_max, assign all postcodes and their respective districts to a dictionary to subsequently match using a loop.

I.e. create,

 d = {'A': ['1011AB','1011AC',...,'1011BD',
            '1011BG','1011BH',...,'1011CE',
            '1011CH','1011CI',...,'1011CZ'],
      'B': [...],           
      }

and then,

found = False
for distr in d.keys(): # loop over districts
     for code in d[distr]: # loop over district's postal codes
         if postal_code in code: # assign if ad's postal code in code                 
             district = distr
             found = True
             break
         else:
             district = 'unknown'
     if found:
         break
  1. Make Python understand there is a range between the postcode_min and the postcode_max, assign ranges and their respective districts to a dictionary, and match using a loop.

I.e. something like,

d = {'A': [range(1011AB,1011BD), range(1011BG,1011CE),range(1011CH,1011CZ)],
     'B': [...]
    }

and then,

found = False
for distr in d.keys(): # loop over districts
     for range in d[distr]: # loop over district's ranges
         if postal_code in range: # assign if ad's postal code in range                 
             district = distr
             found = True
             break
         else:
             district = 'unknown'
     if found:
         break

Issues

For approach 1:

  • How do I create all the postal codes and assign them to a dictionary?

For approach 2:

I used range() for explanatory purpose but I know range() does not work like this.

  • What do I need to effectively have a range() as in the example above?
  • How do I correctly loop over these ranges?

I think my preference lies with approach 2, but I am happy to work with either one. Or with another solution if you have one.

LucSpan
  • 1,831
  • 6
  • 31
  • 66

2 Answers2

1

You can use intervaltree to achieve much better lookup speed, and interpret the postal code as a number in base 36 (10 digits and 26 letters).

from intervaltree import IntervalTree
t = IntervalTree()
for district,postcode_min,postcode_max in your_district_table:
    # We read the postcode as a number in base 36
    postcode_min = int(postcode_min, 36)
    postcode_max = int(postcode_max, 36)
    t[postcode_min:postcode_max] = district

If the postcodes are inclusive (include the "max" postcode), then use this instead:

    t[postcode_min:postcode_max+1] = district

Finally, you can look up districts by post_code like this:

def get_district(post_code):
    intervals = t[int(post_code, 36)]
    if not intervals:
        return None
    # I assume you have only one district that matches a postal code
    return intervals[0][2] # The value of the first interval on the list
Liran Funaro
  • 2,750
  • 2
  • 22
  • 33
  • Thanks! I'm trying it out. I've imported the excel sheet using pandas. However, when I try the for-loop, (i.e. run `for district,postcode_min,postcode_max in df`), I get a `ValueError: too many values to unpack (expected 3)`. What to do? – LucSpan May 15 '17 at 10:16
  • I guess that is an all new stackoverflow question. Use `df.iterrows()`. http://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas – Liran Funaro May 15 '17 at 10:19
1

You can just collect the values in excel like this

d = {'A': ['1011AB', '1011BD', '1011BG', '1011CE',  '1011CH', '1011CZ'],
     'B': ['1061WB', '1061WB'],
     }

def is_in_postcode_range(current_postcode, min, max):
    return min <= current_postcode <= max

def get_district_by_post_code(postcode):
    for district, codes in d.items():
        first_code = codes[0]
        last_code = codes[-1]
        if is_in_postcode_range(postcode, first_code, last_code):
            if any(is_in_postcode_range(postcode, codes[i], codes[i+1]) for i in range(0, len(codes), 2)):
                return district
            else:
                return None

usage:

print get_district_by_post_code('1011AC'): A
print get_district_by_post_code('1011BE'): None
print get_district_by_post_code('1061WB'): B
milo
  • 1,747
  • 9
  • 10
  • I assume that the values in excel are ordered – milo May 15 '17 at 10:10
  • It works for some, but not for all postal codes. The postal codes are spread uniquely over the districts, e.g. `1053ZM` lies only within one district. However, e.g. `1053` can be contained in multiple districts, e.g. `1053AA` upto `1053AZ` in `A` while `1053BA` is in `B`. Would this be causing errors? – LucSpan May 15 '17 at 11:53
  • It wouldn't cause error, if The postal codes are spread uniquely over the district. Of couse the input has to be full postal code like 1053AA not 1053 – milo May 15 '17 at 11:56
  • Can you tell me, which postal codes don't work? A example will be good – milo May 15 '17 at 12:02
  • I think the problem is in the way I make the dictionary: when I post the range from the dictionary in your above example, it works. However, in my code it doesn't. – LucSpan May 15 '17 at 12:03
  • You can always sort the list of each district in the dictionary to be sure that the postal codes are sorted, when they are indeed unqiue and not overlapped. It could be that the values in excel are not all sorted. Or maybe you can post the way you make the dictionary. Let's see what's the problem – milo May 15 '17 at 12:16
  • Where do I show you best how I make the dictionary? The comment section isn't the best I think. BTW I see that there are some duplicates in the excel sheet. Duplicates as in a range from e.g. `1053ZM` to `1053ZM`. Would that be problematic? – LucSpan May 15 '17 at 12:18
  • Duplicates values are ok. There are two things, which make my code invalid. 1. the order of postcode_min and postcode_max of one entry in excel are wrong. like: `A 1011BD 1011AB` this is wrong 2. the order of all entries are wrong. like: ` district postcode_min postcode_max A 1011BG 1011CE A 1011CH 1011CZ A 1011AB 1011BD ` – milo May 15 '17 at 12:54
  • Ok. I sorted all the lists which makes things better. Though, still some not-matching where there should be a match. I'll look into it more. There was one ad with a duplicate postal code – `1061WB` (e.g. `1061WB` to `1061WB`) which did not match. Seems like an error in your code? – LucSpan May 15 '17 at 14:38
  • Duplicate codes should not matter, I have add the case with duplicate in my example code in district B, it works at least there – milo May 15 '17 at 14:46
  • Here's the actual dictionary http://chopapp.com/#pwipjzr9. Try `1061WB`, `1071BM` or `1071TW`. – LucSpan May 15 '17 at 15:07