1

Lets say I have the following dataframe:

df = pd.DataFrame({'a':[1,1.1,1.03,3,3.1], 'b':[10,11,12,13,14]})

df
      a   b
0  1.00  10
1  1.10  11
2  1.03  12
3  3.00  13
4  3.10  14

And I want to group nearby points, eg.

df.groupby(#SOMETHING).mean():

          a     b
a                
0  1.043333  11.0
1  3.050000  13.5

Now, I could use

#SOMETHING = pd.cut(df.a, np.arange(0, 5, 2), labels=False)

But only if I know the boundaries beforehand. How can I accomplish similar behavior if I don't know where to place the cuts? ie. I want to group nearby points (with nearby being defined as within some epsilon).

I know this isn't trivial because point x might be near point y, and point y might be near point z, but point x might be too far z; so then its ambiguous what to do--this is kind of a k-means problem, but I'm wondering if pandas has any tools built in to make this easy.

Use case: I have several processes that generate data on regular intervals, but they're not quite synced up, so the timestamps are close, but not identical, and I want to aggregate their data.

sheridp
  • 1,386
  • 1
  • 11
  • 24
  • `this is kind of a k-means problem` - well more generally a clustering problem. Why not use a clustering algorithm then? – cel Aug 29 '16 at 20:34
  • Well, I'm thinking it might just be overkill. If there is an easy way to make use of, e.g df.a.diff() > 1 , it would be much easier. – sheridp Aug 29 '16 at 20:39
  • `df.a.diff() > 1 , it would be much easier` - yes but that depends on your data. We cannot guess that for you. You have to look at it and see. But be aware that this solution might not generalize well. – cel Aug 29 '16 at 20:43
  • @sheridp, can you post a desired data set, using `epsilon` - because it's not quite clear? – MaxU - stand with Ukraine Aug 29 '16 at 20:46
  • @MaxU, using the above dataset, basically looking for gaps greater than epsilon. epsilon could be 1 in this case. – sheridp Aug 29 '16 at 20:47
  • @sheridp, well, you've already found an answer... ;) – MaxU - stand with Ukraine Aug 29 '16 at 20:55

1 Answers1

1

Based on this answer

df.groupby( (df.a.diff() > 1).cumsum() ).mean()
Community
  • 1
  • 1
sheridp
  • 1,386
  • 1
  • 11
  • 24