Python average tabular data help

Question

Ok I have the following working program. It opens of a file of data in columns that is too large for excel and finds the average value for each column:

Sample data is:

Joe Sam Bob
1   2   3
2   1   3

And it returns

Joe Sam Bob
1.5 1.5 3

This is good. The problem is some columns have NA as a value. I want to skip this NA and calculate the average of the remaining values So

Bobby
1
NA
2

Should output as

Bobby
1.5

Here is my existing program built with help from here. Any help is appreciated!

with open('C://avy.txt', "rtU") as f:
    columns = f.readline().strip().split(" ")
    numRows = 0
    sums = [0] * len(columns)

    for line in f:
        # Skip empty lines
        if not line.strip():
            continue

        values = line.split(" ")
        for i in xrange(len(values)):
            sums[i] += int(values[i])
        numRows += 1

        with open('c://finished.txt', 'w') as ouf:
             for index, summedRowValue in enumerate(sums):
                 print>>ouf, columns[index], 1.0 * summedRowValue / numRows

Now I have this:

with open('C://avy.txt', "rtU") as f:

def get_averages(f):
   headers = f.readline().split()
   ncols = len(headers)
   sumx0 = [0] * ncols
   sumx1 = [0.0] * ncols
   lino = 1

for line in f:
   lino += 1
   values = line.split()

for colindex, x in enumerate(values):
        if colindex >= ncols:
             print >> sys.stderr, "Extra data %r in row %d, column %d" %(x, lino, colindex+1)
             continue
             try:
                value = float(x)
             except ValueError:
               continue
               sumx0[colindex] += 1
        sumx1[colindex] += value
        print headers
print sumx1
print sumx0
averages = [
    total / count if count else None
   for total, count in zip(sumx1, sumx0)
    ]
print averages

and it says:

Traceback (most recent call last): File "C:/avy10.py", line 11, in lino += 1 NameError: name 'lino' is not defined

score 3 · Answer 1 · answered Sep 22 '10 at 21:02

Here is a functional solution:

text = """Joe Sam Bob
1   2   3
2   1   3
NA 2 3
3 5 NA"""

def avg( lst ):
    """ returns the average of a list """
    return 1. * sum(lst)/len(lst)

# split that text
parts = [line.split() for line in text.splitlines()]
#remove the headers
names = parts.pop(0)
# zip(*m) does something like transpose a matrix :-)
columns = zip(*parts)
# convert to numbers and leave out the NA
numbers = [[int(x) for x in column if x != 'NA' ] for column in columns]
# all left is averaging
averages = [avg(col) for col in numbers]
# and printing
for name, x in zip( names, averages):
    print name, x

I wrote a lot of list comprehensions here so you can print out intermediate steps, but those can be generators of cause.

@wok: Elegant solutions don't detect data problems. Elegant solutions with list comprehensions typically waste memory with several copies of the data (whose size is already worrying the OP). — John Machin, Sep 23 '10 at 15:11

Escualo · Answer 2 · 2010-09-22T22:35:35.560

[edited for clarity]

When reading items from a text file, they are imported as strings, not numbers. This means that if your text file has the number 3 and you read it into Python, you would need to convert the string to a number before carrying on arithmetic operations.

Now, you have a text file with colums. Each column has a header and a collection of items. Each item is either a number or not. If it is a number, it will correctly be converted by the function float, if it is not a valid number (this is, if the conversion does not exist) the conversion will raise an exception called ValueError.

So you loop through your list and items as it has been correctly explained in more than one answer. If you can convert to float, accumulate the statistic. If not, go on ignoring that entry.

If you need more info about what is "duck typing" (a paradigm which can be resumed as "better to ask for forgiveness that for permission") please check the Wikipedia link. If you are getting into Python you will hear the term very often.

Below I present a class which can accumulate an statistic (you are interested in the mean). You can use an instance of that class for every column in your table.

class Accumulator(object):
    """
    Used to accumulate the arithmetic mean of a stream of
    numbers. This implementation does not allow to remove items
    already accumulated, but it could easily be modified to do
    so. also, other statistics could be accumulated.
    """
    def __init__(self):
     # upon initialization, the numnber of items currently
     # accumulated (_n) and the total sum of the items acumulated
     # (_sum) are set to zero because nothing has been accumulated
     # yet.
     self._n = 0
     self._sum = 0.0

    def add(self, item):
     # the 'add' is used to add an item to this accumulator
     try:
        # try to convert the item to a float. If you are
        # successful, add the float to the current sum and
        # increase the number of accumulated items
        self._sum += float(item)
        self._n += 1
     except ValueError:
        # if you fail to convert the item to a float, simply
        # ignore the exception (pass on it and do nothing)
        pass

    @property
    def mean(self):
     # the property 'mean' returns the current mean accumulated in
     # the object
     if self._n > 0:
        # if you have more than zero items accumulated, then return
        # their artithmetic average
        return self._sum / self._n
     else:
        # if you have no items accumulated, return None (you could
        # also raise an exception)
        return None

# using the object:

# Create an instance of the object "Accumulator"
my_accumulator = Accumulator()
print my_accumulator.mean
# prints None because there are no items accumulated

# add one (a number)
my_accumulator.add(1)
print my_accumulator.mean
# prints 1.0

# add two (a string - it will be converted to a float)
my_accumulator.add('2')
print my_accumulator.mean
# prints 1.5

# add a 'NA' (will be ignored because it cannot be converted to float)
my_accumulator.add('NA')
print my_accumulator.mean
# prints 1.5 (notice that it ignored the 'NA')

Cheers.

In the past I was hesitant to use this technique, as my primary language is C#, where exceptions are expensive. In Python though this performs swimmingly, and even better clearly conveys the goal of the code. I had the exact same problem once, and this was my solution as well. — Justin R., Sep 22 '10 at 19:03
-1 **WRONG** uses a fixed numRows for each column; should maintain a separate count for each column. — John Machin, Sep 22 '10 at 20:31
@John Machin: if you notice my comment, I specify that the *idea* is to convert to `float` or continue if that is not possible. The relevant part of the loop, of course, would require to do this for every item you want to convert. I never specified that this code was meant to be copy-pasted, I was conveying an idea with code. As you can notice the asker understood this well, sadly you did not. — Escualo, Sep 22 '10 at 21:27
@Arrieta: "You can change the relevant part of your loop" with no qualification??? Only requirement? What about reducing the count of instances? The asker has NOT commented on your answer; how do you divine that he "understood this well"??? — John Machin, Sep 22 '10 at 21:38
I have no idea what Arrieta's answer is all about. If I was that skilled in programming I wouldn't be here. — Robert A. Fettikowski, Sep 22 '10 at 21:47
@John Machin: You are welcome to edit my answer (the whole point of SO). Though judging from your ratio downvote/upvote=1.1 it may be a futile request. I'll make it clearer soon. Thanks for your comments, anyway. — Escualo, Sep 22 '10 at 21:51
@Arrieta: News to me that the whole point of SO is to edit others' crappy answers. Is not providing a better answer a better alternative? My DV/UV ratio is merely a consequence of (1) there being a high proportion of crappy answers on SO (2) I don't upvote answers that an OP could have found by 30 seconds googling. — John Machin, Sep 22 '10 at 22:05
@John. Agree to disagree. We understand collaboration differently. Peace. — Escualo, Sep 22 '10 at 22:07

score -1 · Answer 3 · answered Sep 22 '10 at 16:28

-1

Change your inner-most loop to:

    values = line.split(" ")
    for i in xrange(len(values)):
        if values[i] == "NA":
            continue
        sums[i] += int(values[i])
    numRows += 1

answered Sep 22 '10 at 16:28

RobM

8,373
3
45
37

2

This is plainly wrong - as it does not account for some data sequences getting smaller than others. – jsbueno Sep 22 '10 at 17:12
-1 needs a SEPERATE count for each column; uses int() instead of float() – John Machin Sep 22 '10 at 20:46

inspectorG4dget · Answer 4 · 2010-09-23T00:59:40.583

Much smaller code:

with open('in', "rtU") as f:
    lines = [l for l in f if l.strip()]
    names = '\t'.join(lines[0].split())
    numbers = [[i.strip() for i in line.split()] for line in lines[1:]]
    person_data = zip(*numbers)
    person_data = [tuple(int(i) for i in t if i!="NA") for t in person_data]
    averages = map(lambda x: str(float(sum(x))/len(x)), person_data)

with open('out', 'w') as f:
    f.write(names)
    f.write('\n')
    f.write('\t'.join(averages))

I have tested this after John Machin left his comment. In response to his comments:

This was a bug that existed because I misread the problem. It has been fixed
I have tried to make that line a little more readable now, but to be honest, I don't understand why you called it obfuscatory in the first place
You have pointed out a logic error in my code. I guess I really shouldn't have been doing this in the middle of class... for this I apologize
I agree that readlines() was redundant. I didn't have a proper python interpreter to cross check this, so I left it in as a safety

Hope this is better.

-1 (1) Not loss of plot rather failure to acquire plot; this will die in a hole on the first occurrence of "NA" as a data item (2) the statement beginning "averages =" is excessively obfuscatory (3) `person_data = zip(numbers)` is a nonsense, given `numbers = lines[1:]` (4) `.readlines()` is redundant — John Machin, Sep 22 '10 at 20:43

John Machin · Accepted Answer · 2010-09-22T21:32:22.483

-1

The following code handles varying counts properly, and also detects extra data ... in other words, it's rather robust. It could be improved by explicit messages (1) if the file is empty (2) if the header line is empty. Another possibility is testing explicitly for "NA", and issuing an error message if a field is neither "NA" nor floatable.

>>> import sys, StringIO
>>>
>>> data = """\
... Jim Joe Billy Bob
... 1   2   3     x
... 2   x   x     x  666
...
... 3   4   5     x
... """
>>>
>>> def get_averages(f):
...     headers = f.readline().split()
...     ncols = len(headers)
...     sumx0 = [0] * ncols
...     sumx1 = [0.0] * ncols
...     lino = 1
...     for line in f:
...         lino += 1
...         values = line.split()
...         for colindex, x in enumerate(values):
...             if colindex >= ncols:
...                 print >> sys.stderr, "Extra data %r in row %d, column %d" %
(x, lino, colindex+1)
...                 continue
...             try:
...                 value = float(x)
...             except ValueError:
...                 continue
...             sumx0[colindex] += 1
...             sumx1[colindex] += value
...     print headers
...     print sumx1
...     print sumx0
...     averages = [
...         total / count if count else None
...         for total, count in zip(sumx1, sumx0)
...         ]
...     print averages

Edit add here:

...     return headers, averages

...
>>> sio = StringIO.StringIO(data)
>>> get_averages(sio)
Extra data '666' in row 3, column 5
['Jim', 'Joe', 'Billy', 'Bob']
[6.0, 6.0, 8.0, 0.0]
[3, 2, 2, 0]
[2.0, 3.0, 4.0, None]
>>>

Edit

Normal usage:

with open('myfile.text') as mf:
   hdrs, avgs = get_averages(mf)

edited Sep 22 '10 at 21:32

answered Sep 22 '10 at 20:25

John Machin

81,303
11
141
189

This requires that I manually type in the data right? The data is very lenghty. I don't want to do that. Can't I just open up the file like i do in the old code? – Robert A. Fettikowski Sep 22 '10 at 21:15
No it DOESN'T require that you manually type in the data. Using `StringIO` was only because I wanted to show the function working without requiring a separate input file. I deliberately wrote the answer so that you can use the `get_averages` function on ANY open file-like object. See my edit (which adds a return value to the function). – John Machin Sep 22 '10 at 21:26
Tried it now and it says invalid syntax for this line: – Robert A. Fettikowski Sep 22 '10 at 21:40
2

@Robert: invalid syntax for **WHICH** line? – John Machin Sep 22 '10 at 22:08
print >> sys.stderr, "Extra data %r in row %d, column %d" % (x, lino, colindex+1) – Robert A. Fettikowski Sep 23 '10 at 14:56
Ok now it says: Traceback (most recent call last): File "C:/avy10.py", line 11, in lino += 1 NameError: name 'lino' is not defined – Robert A. Fettikowski Sep 23 '10 at 15:00
@Robert: It's a good idea to mention if you're using Python 3.X ... try `print("Extra data %r in row %d, column %d" % (x, lino, colindex+1), file=sys.stderr)` – John Machin Sep 23 '10 at 15:06
@Robert: (1) Your indentation is stuffed beyond belief. Throw away that code and start again from what I gave you. Don't use tabs anywhere, use spaces. (2) You need to put the function that I gave you up front. Then leave a space. Then put a line with a `with` statement followed by a call of the function -- just like the bit headed "Normal usage" at the end of my answer. – John Machin Sep 23 '10 at 15:21

Python average tabular data help

5 Answers5

Linked