Given a list of items like the following (columns separated by tab):
- 9123456780 \t John Dude \t City of Address \t July 19, 1980 \t M
- 9123456781 \t Jane Dudette \t Province of Address \t Aug 19, 1980 \t f
- 9123456782 \t Sam Pol Data \t Etc. City \t 1/1/91 \t
- 9123456783 \t May Anaise \t Some City \t 1993 \t f
- 9123456784 \t Mark Mywards \t City of Address \t M
- 9123456785 \t City of Address \t July 19, 1980 \t M
- 9123456780 \t M
- Mira Nova \t City of Address \t July 19, 1980
I am to determine which one is the MSISDN (10 digit number), name, address, date, and gender.
I'm pretty sure this is impossible to do 100% correctly / accurately, due to the lack of comparison points, and often missing data.
So here's what I did:
Ran through the list, line by line. Each line is then split by tab (\t), becoming a list. Each item in the list is then tested in a for loop:
for item in csv_cols:
if reg_msisdn.match(item):
s_msisdn = item
if item.lower() in list_male or item.lower() in list_female:
s_gender = item
if parse(item):
s_birthdate = item
if any(ext in item.lower() for ext in list_place) or any(ext in item.lower() for ext in list_ad):
s_address = item
else:
s_name = item
s_all = s_msisdn + "^" + s_name + "^" + s_address + "^" + s_birthdate + "^" + s_gender
EDIT: I added a csv_cols.remove(item)
after every s_(value) = item
so that tested items will be removed already - it didn't change anything.
- All the
s_(value)
start off withNULL
as text - If any item is a 10 digit number (regex), it is considered as the
s_msisdn
. - If any item is solely an m, f, male, female, it is considered as the
s_gender
. - If any item has the keywords city, ave, etc. (list_ad) or matches an item in the list of places (list_place), it is considered as the
s_address
. - If any item can be parsed as a date, it is automatically the
s_birthdate
. - Else, it is probably the
s_name
. - EDIT: Remove said item from the list.
- The entire thing is in a Try-Exception block.
I'm pretty sure there will be glaring holes in my logic here, but I couldn't really think of any other way to do it.
That said, even with this scatter-brained logic, I've encountered issues, specifically with item no. 5 above, which unhelpfully returns the following error:
signed integer is greater than maximum
I know this because taking it out of the loop makes the rest of the code work.
Can I have some help on this please?
Thanks.
P.S.: I'm using a Mac/UNIX if it means anything.