CSV data - max values for segments of columns using numpy

Question

So Let's say I have a csv file with data like so:

'time'  'speed'
0       2.3
0       3.4
0       4.1
0       2.1
1       1.3
1       3.5
1       5.1
1       1.1
2       2.3
2       2.4
2       4.4
2       3.9

I want to be able to return this file so that for each increasing number under the header 'time', I fine the max number found in the column speed and return that number for speed next to the number for time in an array. The actual csv file I'm using is a lot larger so I'd want to iterate over a big mass of data and not just run it where 'time' is 0, 1, or 2.

So basically I want this to return:

array([[0,41], [1,5.1],[2,4.4]])

Using numpy specifically.

And what have __you__ tried so far? – sshashank124 Mar 30 '14 at 14:56 — sshashank124, Mar 30 '14 at 14:56
Related: http://stackoverflow.com/q/8623047/279627 – Sven Marnach Mar 30 '14 at 15:07 — Sven Marnach, Mar 30 '14 at 15:07

Sven Marnach · Answer 1 · 2014-03-30T15:03:01.317

This is a bit tricky to get done in a fully vectorised way in NumPy. Here's one option:

a = numpy.genfromtxt("a.csv", names=["time", "speed"], skip_header=1)
a.sort()
unique_times = numpy.unique(a["time"])
indices = a["time"].searchsorted(unique_times, side="right") - 1
result = a[indices]

This will load the data into a one-dimenasional array with two fields and sort it first. The result is an array that has its data grouped by time, with the biggest speed value always being the last in each group. We then determine the unique time values that occur and find the rightmost entry in the array for each time value.

behzad.nouri · Answer 2 · 2014-03-30T15:19:57.893

pandas fits nicely for this kind of stuff:

>>> from io import StringIO
>>> import pandas as pd
>>> df = pd.read_table(StringIO("""\
... time  speed
... 0       2.3
... 0       3.4
... 0       4.1
... 0       2.1
... 1       1.3
... 1       3.5
... 1       5.1
... 1       1.1
... 2       2.3
... 2       2.4
... 2       4.4
... 2       3.9
... """), delim_whitespace=True)
>>> df
    time  speed
0      0    2.3
1      0    3.4
2      0    4.1
3      0    2.1
4      1    1.3
5      1    3.5
6      1    5.1
7      1    1.1
8      2    2.3
9      2    2.4
10     2    4.4
11     2    3.9

[12 rows x 2 columns]

once you have the data-frame, all you need is groupby time and aggregate by maximum of speeds:

>>> df.groupby('time')['speed'].aggregate(max)
time
0       4.1
1       5.1
2       4.4
Name: speed, dtype: float64

Even though `numpy` is the wrong choice for this problem, and `pandas` a much better one, the OP did say "Using numpy specifically".. — DSM, Mar 30 '14 at 15:01

CSV data - max values for segments of columns using numpy

2 Answers2