I'll make this community wiki, because it's more "here's how I think you should do it instead" than "here's the answer to the question you asked". For something like this I'd probably use pandas
instead of numpy
, as its grouping tools are much better. It'll also be useful to compare with numpy
-based approaches.
import pandas as pd
df = pd.read_csv("data.txt", sep="[ _]", header=None,
names=["name", "property", "year", "value"])
means = df.groupby(["name", "property"])["value"].mean()
.. and, er, that's it.
First, read in the data into a DataFrame
, letting either whitespace or _
separate columns:
>>> import pandas as pd
>>> df = pd.read_csv("data.txt", sep="[ _]", header=None,
names=["name", "property", "year", "value"])
>>> df
name property year value
0 david weight 2005 50
1 david weight 2012 60
2 david height 2005 150
3 david height 2012 160
4 mark weight 2005 90
5 mark weight 2012 85
6 mark height 2005 160
7 mark height 2012 170
Then group by name
and property
, take the value
column, and compute the mean:
>>> means = df.groupby(["name", "property"])["value"].mean()
>>> means
name property
david height 155.0
weight 55.0
mark height 165.0
weight 87.5
Name: value, dtype: float64
.. okay, the sep="[ _]"
trick is a little too cute for real code, though it works well enough here. In practice I'd use a whitespace separator, read in the second column as property_year
and then do
df["property"], df["year"] = zip(*df["property_year"].str.split("_"))
del df["property_year"]
to allow underscores in other columns.