1

I looked around for a while and didn't find anything that matched what I was doing.

I have this code:

import csv
import datetime

legdistrict = []
reader = csv.DictReader(open('active.txt', 'rb'), delimiter='\t')

for row in reader:
    if '27' in row['LegislativeDistrict']:
        legdistrict.append(row)

ages = []

for i,value in enumerate(legdistrict):
    dates = datetime.datetime.now() - datetime.datetime.strptime(value['Birthdate'], '%m/%d/%Y')
    ages.append(int(datetime.timedelta.total_seconds(dates) / 31556952))

total_values = len(ages)
total = sum(ages) / total_values

print total_values
print sum(ages)
print total

which searches a tab-delimited text file and finds the rows in the column named LegislativeDistrict that contain the string 27. (So, finding all rows that are in the 27th LD.) It works well, but I run into issues if the string is a single digit number.

When I run the code with 27, I get this result:

0 ;) eric@crunchbang ~/sbdmn/May 2014 $ python data.py
74741
3613841
48

Which means there are 74,741 values that contain 27, with combined ages of 3,613,841, and an average age of 48.

But when I run the code with 4 I get this result:

0 ;) eric@crunchbang ~/sbdmn/May 2014 $ python data.py
1177818
58234407
49

The first result (1,177,818) is much too large. There are no LDs in my state over 170,000 people, and my lists deal with voters only.

Because of this, I'm assuming using 4 is finding all the values that have 4 in them... so 14, 41, and 24 would all be used thus causing the huge number.

Is there a way I can search for a value in a specific column and use a regex or exact search? Regex works, but I can't get it to search just one column -- it searches the entire text file.

My data looks like this:

StateVoterID    CountyVoterID   Title   FName   MName   LName   NameSuffix  Birthdate   Gender  RegStNum    RegStFrac   RegStName   RegStType   RegUnitType RegStPreDirection   RegStPostDirection  RegUnitNum  RegCity RegState    RegZipCode  CountyCode  PrecinctCode    PrecinctPart    LegislativeDistrict CongressionalDistrict   Mail1   Mail2   Mail3   Mail4   MailCity    MailZip MailState   MailCountry Registrationdate    AbsenteeType    LastVoted   StatusCode
IDNUMBER    OTHERIDNUMBER       NAME        MI      01/01/1900  M   123     FIRST   ST      W           CITY    STATE   ZIP MM  123 4   AGE 5                                   01/01/1950  N   01/01/2000  B
Eric Lagergren
  • 501
  • 2
  • 7
  • 18
  • 3
    `'4' in 400` (for example) will return `True` as `in` does a substring check - any particular reason you're not using `==` to check string equality instead? – Jon Clements Jul 03 '14 at 23:53
  • like `if row === 4`? @JonClements – Eric Lagergren Jul 03 '14 at 23:53
  • `if row['LegislativeDistrict'] == '4'`... `'4' in '400'` is `True`, `'4' == '400'` is `False` – Jon Clements Jul 03 '14 at 23:54
  • Ah, that makes total sense... Don't know why I didn't try that. Python's still very foreign to me. Thank you @JonClements – Eric Lagergren Jul 03 '14 at 23:57
  • 1
    You might also wish to watch out for your integer divisions there... You might want to make sure some values are explicitly floats. Try out the results of `3/2` in your interpreter and `float(3)/2` (or just `3.0/2` - it looks like you've just got `int`s there so you'll be losing precision – Jon Clements Jul 03 '14 at 23:58
  • (you may even just find it easier to do a `from __future__ import division` as the first import line of your script to enable true division by default and you'll be able to see the difference on a re-run) – Jon Clements Jul 04 '14 at 00:02
  • @JonClements I actually am using int() on the timedelta because I'm using ages, and ages are either one thing or another. As far as my data is concerned, if somebody's 45 and 4 months they're still 45, not 45.25 or 46. Thanks for the comment though. I really do appreciate it. If you want, put your comments in an answer so you'll get your points because you *did* answer my question :-) – Eric Lagergren Jul 04 '14 at 00:08
  • @JonClements Although, it did just occur to me that I should probably round up the ages of the average is something like 45.9... does Python have a version of `Math.ceil`? – Eric Lagergren Jul 04 '14 at 00:10
  • 1
    I'm off to bed... see [math.ceil](https://docs.python.org/2/library/math.html#math.ceil) :) – Jon Clements Jul 04 '14 at 00:11

1 Answers1

1

'4' in '400' will return True as in does a substring check. Use instead '4' == '400', which only will return True if the two strings are identical:

if '4' == row['LegislativeDistrict']:
    (...)
Community
  • 1
  • 1
dwitvliet
  • 7,242
  • 7
  • 36
  • 62