calculate mean using numpy ndarray

Question

The text file look like:

david weight_2005 50
david weight_2012 60
david height_2005 150
david height_2012 160
mark weight_2005 90
mark weight_2012 85
mark height_2005 160
mark height_2012 170

How to calculate mean of weight and height for david and mark as follows:

david>> mean(weight_2005 and weight_2012), mean (height_2005 and height_2012)
mark>> mean(weight_2005 and weight_2012), mean (height_2005 and height_2012)

my incomplete code is:

 import numpy as np
 import csv
 with open ('data.txt','r') as infile:
   contents = csv.reader(infile, delimiter=' ')
   c1,c2,c3 = zip(*contents)
   data = np.array(c3,dtype=float)

Then how to apply np.mean??

score 5 · Answer 1 · answered Nov 12 '13 at 16:48

The mean function is for computing the average of an array of numbers. You will need to come up with a way to select the values of c3 by applying a condition to c2.

What would probably suit your needs better would be splitting up the data into a hierarchical structure, I prefer using dictionaries. Something like

data = {}
with open('data.txt') as f:
    contents = csv.reader(f, delimiter=' ')
for (name, attribute, value) in contents:
    data[name] = data.get(name, {})  # Default value is a new dict
    attr_name, attr_year = attribute.split('_')
    attr_year = int(attr_year)
    data[name][attr_name] = data[name].get(attr_name, {})
    data[name][attr_name][attr_year] = value

Now data will look like

{
    "david": {
        "weight": {
            2005: 50,
            2012: 60
        },
        "height": {
            2005: 150,
            2012: 160
        }
    },
    "mark": {
        "weight": {
            2005, 90,
            2012, 85
        },
        "height": {
            2005: 160,
            2012: 170
        }
    }
}

Then what you can do is

david_avg_weight = np.mean(data['david']['weight'].values())
mark_avg_height = np.mean([v for k, v in data['mark']['height'].iteritems() if 2008 < k])

Here I'm still using np.mean, but only calling it on a normal Python list.

thanks for your efforts, upvoted! but i am looking for more shorter way to do it mainly using numpy @bheklilr — 2964502, Nov 12 '13 at 16:52
@nils NumPy isn't going to make this code any shorter. Even in your example, your code is all parsing the file. Mine just parses the file into a more useful data structure that can then have NumPy functions applied to it. All you want NumPy for is calculating the average, but because you want to be able to do it by conditions, you need to get your data into a form that is more easily manipulated. Pandas might be a good library for doing this for you, but I personally don't see why 9 lines of code is too long. — bheklilr, Nov 12 '13 at 16:56

score 4 · Answer 2 · answered Nov 12 '13 at 17:04

I'll make this community wiki, because it's more "here's how I think you should do it instead" than "here's the answer to the question you asked". For something like this I'd probably use pandas instead of numpy, as its grouping tools are much better. It'll also be useful to compare with numpy-based approaches.

import pandas as pd
df = pd.read_csv("data.txt", sep="[ _]", header=None, 
                 names=["name", "property", "year", "value"])
means = df.groupby(["name", "property"])["value"].mean()

.. and, er, that's it.

First, read in the data into a DataFrame, letting either whitespace or _ separate columns:

>>> import pandas as pd
>>> df = pd.read_csv("data.txt", sep="[ _]", header=None, 
                 names=["name", "property", "year", "value"])
>>> df
    name property  year  value
0  david   weight  2005     50
1  david   weight  2012     60
2  david   height  2005    150
3  david   height  2012    160
4   mark   weight  2005     90
5   mark   weight  2012     85
6   mark   height  2005    160
7   mark   height  2012    170

Then group by name and property, take the value column, and compute the mean:

>>> means = df.groupby(["name", "property"])["value"].mean()
>>> means
name   property
david  height      155.0
       weight       55.0
mark   height      165.0
       weight       87.5
Name: value, dtype: float64

.. okay, the sep="[ _]" trick is a little too cute for real code, though it works well enough here. In practice I'd use a whitespace separator, read in the second column as property_year and then do

df["property"], df["year"] = zip(*df["property_year"].str.split("_"))
del df["property_year"]

to allow underscores in other columns.

score 2 · Accepted Answer · answered Nov 12 '13 at 16:58

2

You can read your data directly in a numpy array with:

data = np.recfromcsv("data.txt", delimiter=" ", names=['name', 'type', 'value'])

then you can find appropriate indices with np.where :

indices = np.where((data.name == 'david') * data.type.startswith('height'))

and perform the mean on thoses indices :

np.mean(data.value[indices])

answered Nov 12 '13 at 16:58

Nicolas Barbey

6,639
4
28
34

it would be better if you could explain the meaning of * in your code @Nicolas Barbey – 2964502 Nov 12 '13 at 17:19
There is a TypeError: startswith first arg must be bytes or a tuple of bytes, not numpy.str_. How to correct for it?@Nicolas Barbey – 2964502 Nov 12 '13 at 17:26
2

* is just multiplication of boolean arrays. – Nicolas Barbey Nov 12 '13 at 17:43
I do not understand the TypeError. I tested on python 2.7.3. What is your version of python ? – Nicolas Barbey Nov 12 '13 at 17:44
i am using python 3.2 and numpy 1.8 @Nicolas Barbey – 2964502 Nov 13 '13 at 02:26
the type error produced in python 3 and numpy 1.8 has been solved by @DSM under http://stackoverflow.com/questions/19944408/getting-indices-in-numpy – 2964502 Nov 13 '13 at 03:21

score 1 · Answer 4 · answered Nov 12 '13 at 16:54

1

If your data is always in the format provided. Then you could do this using array slicing:

(data[:-1:2] + data[1::2]) / 2

Results in:

[  55.   155.    87.5  165. ]

answered Nov 12 '13 at 16:54

dnf0

1,609
17
21

calculate mean using numpy ndarray

4 Answers4