0

Given a list of items like the following (columns separated by tab):

  • 9123456780 \t John Dude \t City of Address \t July 19, 1980 \t M
  • 9123456781 \t Jane Dudette \t Province of Address \t Aug 19, 1980 \t f
  • 9123456782 \t Sam Pol Data \t Etc. City \t 1/1/91 \t
  • 9123456783 \t May Anaise \t Some City \t 1993 \t f
  • 9123456784 \t Mark Mywards \t City of Address \t M
  • 9123456785 \t City of Address \t July 19, 1980 \t M
  • 9123456780 \t M
  • Mira Nova \t City of Address \t July 19, 1980

I am to determine which one is the MSISDN (10 digit number), name, address, date, and gender.

I'm pretty sure this is impossible to do 100% correctly / accurately, due to the lack of comparison points, and often missing data.

So here's what I did:

Ran through the list, line by line. Each line is then split by tab (\t), becoming a list. Each item in the list is then tested in a for loop:

for item in csv_cols:
    if reg_msisdn.match(item):
        s_msisdn = item
    if item.lower() in list_male or item.lower() in list_female:
        s_gender = item
    if parse(item):
        s_birthdate = item
    if any(ext in item.lower() for ext in list_place) or any(ext in item.lower() for ext in list_ad):
        s_address = item
    else:
        s_name = item

    s_all = s_msisdn + "^" + s_name + "^" + s_address + "^" + s_birthdate + "^" + s_gender

EDIT: I added a csv_cols.remove(item) after every s_(value) = item so that tested items will be removed already - it didn't change anything.

  1. All the s_(value) start off with NULL as text
  2. If any item is a 10 digit number (regex), it is considered as the s_msisdn.
  3. If any item is solely an m, f, male, female, it is considered as the s_gender.
  4. If any item has the keywords city, ave, etc. (list_ad) or matches an item in the list of places (list_place), it is considered as the s_address.
  5. If any item can be parsed as a date, it is automatically the s_birthdate.
  6. Else, it is probably the s_name.
  7. EDIT: Remove said item from the list.
  8. The entire thing is in a Try-Exception block.

I'm pretty sure there will be glaring holes in my logic here, but I couldn't really think of any other way to do it.

That said, even with this scatter-brained logic, I've encountered issues, specifically with item no. 5 above, which unhelpfully returns the following error:

signed integer is greater than maximum

I know this because taking it out of the loop makes the rest of the code work.

Can I have some help on this please?

Thanks.

P.S.: I'm using a Mac/UNIX if it means anything.

Community
  • 1
  • 1
zack_falcon
  • 4,186
  • 20
  • 62
  • 108

3 Answers3

1

It's trying to parse your msisdn as a date, which raises an exception rather than returns false, which is why if parse(...): doesn't work. Use something like

try:
    parse(item)
    continue
except ValueError:
    pass

(If it's not ValueError then check what type of exception is being thrown and use that)

Stuart Leigh
  • 826
  • 4
  • 6
  • I forgot to mention that the entire for loop is inside a Try catch block. Added it to the question. Yes, that does make sense - it's trying to parse msisdn as a date. – zack_falcon Aug 17 '14 at 08:23
1

It's going to be almost impossible to distinguish names from addresses in all cases (is "Jack Street" a name or an address?). The best way would be to assume that the name always appears before the address otherwise you faced with doing all sorts of complicated lookups.

I'd deal with the data row by row, first turning the row into a list like this:

>>> row = "9123456780 \t John Dude \t City of Address \t July 19, 1980 \t M"
>>> row = [entry.strip() for entry in row.split("\t")]
>>> row
['9123456780', 'John Dude', 'City of Address', 'July 19, 1980', 'M']

Now I'd define a bunch of functions to try to decide what each entry represented. It should be relatively straightforward to determine the MSISDN number, the gender and the date.

To determine if an entry in a row is the ten-digit MSISDN number:

import re

def is_msisdn(entry):
    if re.match(r"[0-9]{10}", entry):
         return True

To determine if an entry in a row indicates the gender:

def is_gender(entry):
    if entry in ("m", "f", "M", "F"):
         return True

To determine if an entry in a row represents a date:

from dateutil.parser import parse

def is_date(entry):
    try: 
        parse(entry)
        return True
    except ValueError:
        return False

Now use these functions to build another function that parses the entries of the row:

def parse_row(row):

    s_all = ["<blank>"] * 5

    for entry in row:

        if is_msisdn(entry):
            s_all[0] = entry

        elif is_gender(entry):
            s_all[4] = entry

        elif is_date(entry):
            s_all[3] = entry

        elif s_all[1] == "<blank>":
            s_all[1] = entry

        else:
            s_all[2] = entry

    return " ^ ".join(s_all)

This gives for example:

>>> row = ['Mira Nova', 'City of Address', 'July 19, 1980']
>>> parse_row(row)
'<blank> ^ Mira Nova ^ City of Address ^ July 19, 1980 ^ <blank>'

>>> row = ['9123456784', 'Mark Mywards', 'City of Address', 'M']
>>> parse_row(row)
'9123456784 ^ Mark Mywards ^ City of Address ^ <blank> ^ M'
Alex Riley
  • 169,130
  • 45
  • 262
  • 238
  • I'm afraid it's now returning a `'NoneType' object is not iterable` - from the is_date function. – zack_falcon Aug 17 '14 at 13:40
  • What value are you sending into `is_date`? – Alex Riley Aug 17 '14 at 15:39
  • Same as above: item (`for item in row`). Commenting out the is_date(entry) to return only false makes the rest of the code work. – zack_falcon Aug 17 '14 at 15:49
  • I can't seem to replicate the error when trying the code on any of your examples. What is the row you're trying to parse? – Alex Riley Aug 17 '14 at 16:00
  • All of them, via for loop. Strange thing is, it returns row no. 7 with `9123456780^NULL^NULL^NULL^M` (I replaced the with NULL). Everything else returns `'NoneType' object is not iterable`. – zack_falcon Aug 17 '14 at 16:12
0

Try this:

if reg_msisdn.match(item):
        s_msisdn = item
    elif item.lower() in list_male or item.lower() in list_female:
        s_gender = item
    elif parse(item):
        s_birthdate = item
    elif any(ext in item.lower() for ext in list_place) or 
         any(ext in item.lower() for ext in list_ad):
        s_address = item
    else:
        s_name = item

You may need to adjust the order of tests. The idea is that once an item is matched, the other tests are not applied. That will improve the speed, too.

davidc
  • 132
  • 1
  • 4
  • I tried it out that way, but the parse(item) returns `'NoneType' object is not iterable` which is something I understand even less. – zack_falcon Aug 17 '14 at 08:36