2

So Let's say I have a csv file with data like so:

'time'  'speed'
0       2.3
0       3.4
0       4.1
0       2.1
1       1.3
1       3.5
1       5.1
1       1.1
2       2.3
2       2.4
2       4.4
2       3.9

I want to be able to return this file so that for each increasing number under the header 'time', I fine the max number found in the column speed and return that number for speed next to the number for time in an array. The actual csv file I'm using is a lot larger so I'd want to iterate over a big mass of data and not just run it where 'time' is 0, 1, or 2.

So basically I want this to return:

array([[0,41], [1,5.1],[2,4.4]]) 

Using numpy specifically.

Sven Marnach
  • 574,206
  • 118
  • 941
  • 841
user3324536
  • 325
  • 1
  • 6
  • 13

2 Answers2

1

This is a bit tricky to get done in a fully vectorised way in NumPy. Here's one option:

a = numpy.genfromtxt("a.csv", names=["time", "speed"], skip_header=1)
a.sort()
unique_times = numpy.unique(a["time"])
indices = a["time"].searchsorted(unique_times, side="right") - 1
result = a[indices]

This will load the data into a one-dimenasional array with two fields and sort it first. The result is an array that has its data grouped by time, with the biggest speed value always being the last in each group. We then determine the unique time values that occur and find the rightmost entry in the array for each time value.

Sven Marnach
  • 574,206
  • 118
  • 941
  • 841
1

pandas fits nicely for this kind of stuff:

>>> from io import StringIO
>>> import pandas as pd
>>> df = pd.read_table(StringIO("""\
... time  speed
... 0       2.3
... 0       3.4
... 0       4.1
... 0       2.1
... 1       1.3
... 1       3.5
... 1       5.1
... 1       1.1
... 2       2.3
... 2       2.4
... 2       4.4
... 2       3.9
... """), delim_whitespace=True)
>>> df
    time  speed
0      0    2.3
1      0    3.4
2      0    4.1
3      0    2.1
4      1    1.3
5      1    3.5
6      1    5.1
7      1    1.1
8      2    2.3
9      2    2.4
10     2    4.4
11     2    3.9

[12 rows x 2 columns]

once you have the data-frame, all you need is groupby time and aggregate by maximum of speeds:

>>> df.groupby('time')['speed'].aggregate(max)
time
0       4.1
1       5.1
2       4.4
Name: speed, dtype: float64
behzad.nouri
  • 74,723
  • 18
  • 126
  • 124
  • 1
    Even though `numpy` is the wrong choice for this problem, and `pandas` a much better one, the OP did say "Using numpy specifically".. – DSM Mar 30 '14 at 15:01