2

I am a Python beginner trying to count the number of certain sizes in a big data set. The original data is in a text file separated by tabs. I have "Names" (string, but each row seems like a list) of different animals and "Sizes" (integer number) of them in a different row. I would like to count the number of all the animals that fall in certain size ranges, between 10-30.

So far, I have successfully counted how many of each "Name" I have but failing in specifying the "Size". The code I have is below, and I dot get any error but it just gets ignored. Could somebody please help me why the codes are being ignored? Thank you for your help in advance!

import csv, collections

reader=csv.reader(open('C:\Users\Owl\Desktop\Data.txt','rb'), delimiter='\t')
counts=collections.Counter()

for line in reader:
   Name=line[1]
   Size=line[10]
   counts[Name]+=1

for (Name, count) in counts.iteritems():
   if 10<=Size<=30:
      print '%s: %s' % (Name, count)
owl
  • 1,841
  • 6
  • 20
  • 30
  • use `r''` modifier for literal strings that are Windows paths: compare `r'c:\tmp'` and `'c:\tmp'`. – jfs Aug 02 '12 at 02:06

3 Answers3

3

As written, Size will be permanently set to the last size value in the file, it's not stored along with Name.

Each round through the for loop, Size is set to line[10], but it's not stored in anything outside of the scope of the loop. Name is indirectly stored in the counter. So the next time the loop runs, the value of Size changes to the next animal's size.

Does each animal appear more than once in the data?

You will either need a slightly more complex data structure or to look at the size while looping through the file.

If you don't mind ignoring the animals outside of the size range:

for line in reader:
    size = float(line[10])
    if 10 <= size <= 30:
        name = line[1]
        counts[name] += 1

for name, count in counts.iteritems():
    print '%s: %s' % name, count

(Note: I've changed the case and whitespace of your original code to match Python's recommended style guide, pep8.)

Lenna
  • 1,445
  • 9
  • 21
  • I am sorry I am such a beginner. What do you mean by "it's not stored along with Name"? I think that is what I often do, but still having hard time what I am doing wrong... – owl Aug 01 '12 at 20:41
  • Thank you so much for your help! I just noticed that the size was integer data. Because I got an error ValueError: could not convert string to float: Size, I tried size=int(line[10]) but I got ValueError: invalid literal for int() with base 10: 'size'... What does that mean? – owl Aug 01 '12 at 20:45
  • I edited my answer to explain the for loop scope. That error means that line[10] is the string 'size', not an integer. One thought: Python slices start at 0, so line[10] is really the 11th item. Is it possible you want line[9]? – Lenna Aug 01 '12 at 20:47
  • you could move `name = line[1]` inside the `if` statement as in my code – jfs Aug 01 '12 at 20:50
  • Yes, I thought that might have happened and double checked! But I think I know what happened. The first line is "label" so they are strings. Is there a way to read from the second line using Python? Or should I need to delete that line in the original data? – owl Aug 01 '12 at 20:51
  • 1
    @owl: to skip the first line you could call: `next(f)` before passing it to the csv.reader(), where `f` is your file – jfs Aug 01 '12 at 21:00
  • I just tried deleting the first row but still gave me that same error, ValueError: invalid literal for int() with base 10: 'size'... I will now try reader.next() method to see if I do not have to modify the original data! – owl Aug 01 '12 at 21:02
  • @J.F. Sebastian, Thank you so much! I will try that now. – owl Aug 01 '12 at 21:03
  • I know it is really basic question but do I just add next(f) before the "reader=csv.reader(open('C:\Users\Owl\Desktop\Data.txt','rb'), delimiter='\t')" line? I tried and get "NameError: name 'f' is not defined". I often get this error because I am still not good at defining new names... – owl Aug 01 '12 at 21:09
  • `f = open('C:\Users\Owl\Desktop\Data.txt', 'rb')` then `next(f)`. And change it to `csv.reader(f)`. – Lenna Aug 01 '12 at 21:11
  • Thank you so much for being so nice! :) – owl Aug 01 '12 at 21:13
  • This time I got an error "size=int(line[10]) IndexError: list index out of range"... – owl Aug 01 '12 at 21:17
  • Both combining Lenna's and unutbu's solved my problem! Thank you so much for all of your help!! – owl Aug 01 '12 at 21:26
2
Size=line[10]

makes Size a string.

10<=Size<=30

compares ints with a string (Size).

In [3]: 10 <= '20' <= 30
Out[3]: False

To fix this use:

try:
    Size = float(line[10])
except (ValueError, IndexError):
    continue

The try...except above will cause your program to skip lines in your csv file that either does not have an 11th column or has a string there which can not be converted to a float.


In Python2, ints compare less than strings.

In [4]: 10 <= '1'
Out[4]: True

(Believe it or not, because i as in int comes before s as in string in the alphabet...)

In Python3, a TypeError is raised.

Python 3.2.2 (default, Sep  5 2011, 22:09:30) 
[GCC 4.6.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> 10 <= '1'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unorderable types: int() <= str()

Hallelujah.

Community
  • 1
  • 1
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • Thank you so much for your quick response! I just tried and got an error ValueError: could not convert string to float: Size – owl Aug 01 '12 at 20:38
  • That could happen if `Size` contains letters as well as numbers. Try printing `Size` to see what it equals. – unutbu Aug 01 '12 at 20:43
  • Thank you, I just noticed that the first line of the data was string data because they tell what each column were. Is there any way to skip the first line? – owl Aug 01 '12 at 20:55
  • owl, I've added some code above to show how to handle the first line. – unutbu Aug 01 '12 at 21:08
  • Super! It worked without causing the error finally! Thank you so much for all of your help!! You all are really great! Thank you!!! :) – owl Aug 01 '12 at 21:23
  • @owl, glad I could help. I totally missed the logic error Lenna points out, though; you might want to accept her answer. – unutbu Aug 01 '12 at 21:27
1

One of the cool features of python is that keys in dictionaries can be pretty advanced things such as... tadaaa!... tuples (or dates, or a lot of stuff... as long it's hashable, as J.F. Sebastian pointed out -nothing illegal with hashes, here-). Combine that with regular expressions and you have a pretty fancy "Size classifier" :-) :

sizesFromFile = [
    "Name: Cat, Size: 3.2",
    "Name: Dog, Size: 4.2",
    "Name: BigFoot, Size: 12",
    "Name: Elephant, Size: 31.4",
    "Name: Whale, Size: 85.99",
]

import re
import sys
regex = re.compile(r"^Name:\s*(?P<name>\w+),\s+Size:\s+(?P<size>[\d\.]+)")

myRanges = {
    (0, 10): list(),
    (11, 20): list(),
    (21, 30): list(),
    (31, sys.maxint): list()
}

for line in sizesFromFile:
    match = regex.match(line)
    if match is not None:
        print "Success parsing %s, %s" % (match.groupdict()["name"], match.groupdict()["size"])
        name = match.groupdict()["name"]
        size = float(match.groupdict()["size"])
        for myRange in myRanges:
            if size >= myRange[0] and size <= myRange[1]:
                myRanges[myRange].append(name)

print "This is what I got: %s" % (myRanges)

That outputs:

This is what I got: {(21, 30): [], (11, 20): ['BigFoot'], (0, 10): ['Cat', 'Dog'], (31, 2147483647): ['Elephant', 'Whale']}

Although I'm pretty sure this is very non-optimal, speed speaking... but it's still kinna cool, right?

Savir
  • 17,568
  • 15
  • 82
  • 136
  • I was actually just about to comment that while I love regex, it seems like overkill for data this structured. The OP also mentions that the data set is 'big' :) – Lenna Aug 01 '12 at 20:51
  • 2
    actually, a list can't be a dict key. Only hashable objects are allowed – jfs Aug 01 '12 at 20:53
  • Thank you for your help! Is there a way to avoid the sizesFromFile part in your code? The issue with my data is that it is very large that it is impossible to write them out all... That is why the data is in a txt file and not in Excel (goes over the row limit). – owl Aug 01 '12 at 20:53
  • @J.F. Sebastian... Dang!! Right! Fixed – Savir Aug 01 '12 at 20:54
  • @owl: Yeah, instead of of having the "sizesFromFile" list, just make the line to be the line you read from the file (the "line" variable in your example... arg... this is confusing now... You can just read the file line by line and process it with the regular expression) *for line in file.readline()*: (instead of *for line in sizesFromFile*) – Savir Aug 01 '12 at 20:59
  • @Lenna... Agreed, but... Still cool!! **:D** It ws pretty much to show off a bit – Savir Aug 01 '12 at 21:01