Grouping nearby data in pandas

Question

Lets say I have the following dataframe:

df = pd.DataFrame({'a':[1,1.1,1.03,3,3.1], 'b':[10,11,12,13,14]})

df
      a   b
0  1.00  10
1  1.10  11
2  1.03  12
3  3.00  13
4  3.10  14

And I want to group nearby points, eg.

df.groupby(#SOMETHING).mean():

          a     b
a                
0  1.043333  11.0
1  3.050000  13.5

Now, I could use

#SOMETHING = pd.cut(df.a, np.arange(0, 5, 2), labels=False)

But only if I know the boundaries beforehand. How can I accomplish similar behavior if I don't know where to place the cuts? ie. I want to group nearby points (with nearby being defined as within some epsilon).

I know this isn't trivial because point x might be near point y, and point y might be near point z, but point x might be too far z; so then its ambiguous what to do--this is kind of a k-means problem, but I'm wondering if pandas has any tools built in to make this easy.

Use case: I have several processes that generate data on regular intervals, but they're not quite synced up, so the timestamps are close, but not identical, and I want to aggregate their data.

`this is kind of a k-means problem` - well more generally a clustering problem. Why not use a clustering algorithm then? — cel, Aug 29 '16 at 20:34
Well, I'm thinking it might just be overkill. If there is an easy way to make use of, e.g df.a.diff() > 1 , it would be much easier. — sheridp, Aug 29 '16 at 20:39
`df.a.diff() > 1 , it would be much easier` - yes but that depends on your data. We cannot guess that for you. You have to look at it and see. But be aware that this solution might not generalize well. — cel, Aug 29 '16 at 20:43
@sheridp, can you post a desired data set, using `epsilon` - because it's not quite clear? — MaxU - stand with Ukraine, Aug 29 '16 at 20:46
@MaxU, using the above dataset, basically looking for gaps greater than epsilon. epsilon could be 1 in this case. — sheridp, Aug 29 '16 at 20:47

score 1 · Answer 1 · edited May 23 '17 at 10:33

1

Based on this answer

df.groupby( (df.a.diff() > 1).cumsum() ).mean()

edited May 23 '17 at 10:33

Community

1
1

answered Aug 29 '16 at 20:46

sheridp

1,386
1
11
24

Grouping nearby data in pandas

1 Answers1