0

I am using scipy.binned_statistic to get the frequency of points within a bin, such that:

h, xedge, yedge, binindex = scipy.stats.binned_statistic_2d(X, Y, Y, statistic='mean', bins=160)

I am able to filter out certain bins using the following:

filter = list(np.argwhere(h > 5).flatten())

From this I can get the bin edges/center from edges and yedges for the data I am interested in.

What is the most pythonic way to get the original data from these bins of interest? For example, how do I get the original data that is contained within the bins which have more than 5 points?

GeoMonkey
  • 1,615
  • 7
  • 28
  • 56

1 Answers1

1

Yes, that's possible with some indexing magic. I am not sure if it is the most Pythonic way, but it should be close.

The solution for 1d using stats.binned_statistic:

from scipy import stats
import numpy as np

values = np.array([1.0, 1.0, 2.0, 1.5, 3.0]) # not used with 'count'
x = np.array([1, 1, 1, 4, 7, 7, 7])

statistic, bin_edges, binnumber = stats.binned_statistic(x, values, 'count', bins=3)
print(statistic)
print(bin_edges)
print(binnumber)

# find the bins with equal or more than three events
# if you are using custom bins where events can be lower or
# higher than your specified bins -> handle this

# get the bin numbers according to some condition
idx_bin = np.where(statistic >= 3)[0]
print(idx_bin)

# A binnumber of i means the corresponding value is 
# between (bin_edges[i-1], bin_edges[i]).
# -> increment the bin indices by one 

idx_bin += 1
print(idx_bin)

# the rest is easy, get the boolean mask and apply it
is_event = np.in1d(binnumber, idx_bin)
events = x[is_event]
print(events)

For 2d or nd you could use the solution above multiple times and combine the is_event masks for each dimension using np.logical_and (2d) or np.logical_and.reduce((x, y, z)) (nd, see here).

The solution for 2d using stats.binned_statistic_2d is basically the same:

from scipy import stats
import numpy as np

x = np.array([1, 1.5, 2.0, 4, 5.5, 1.5, 7, 1])
y = np.array([1.0, 7.0, 1.0, 3, 7, 7, 7, 1])
values = np.ones_like(x) # not used with 'count'

# check keyword expand_binnumbers, use non-linearized
# as they can be used as indices without flattening
ret = stats.binned_statistic_2d(x, 
                                y,
                                values,
                                'count',
                                bins=2,
                                expand_binnumbers=True)

print(ret.statistic)
print('binnumber', ret.binnumber)

binnumber = ret.binnumber
statistic = ret.statistic
# find the bins with equal or more than three events
# if you are using custom bins where events can be lower or
# higher than your specified bins -> handle this

# get the bin numbers according to some condition
idx_bin_x, idx_bin_y = np.where(statistic >= 3)#[0]
print(idx_bin_x)
print(idx_bin_y)

# A binnumber of i means the corresponding value is 
# between (bin_edges[i-1], bin_edges[i]).
# -> increment the bin indices by one 
idx_bin_x += 1
idx_bin_y += 1

print(idx_bin_x)
print(idx_bin_y)

# the rest is easy, get the boolean mask and apply it
is_event_x = np.in1d(binnumber[0], idx_bin_x)
is_event_y = np.in1d(binnumber[1], idx_bin_y)
is_event_xy = np.logical_and(is_event_x, is_event_y)

events_x = x[is_event_xy]
events_y = y[is_event_xy]

print('x', events_x)
print('y', events_y)
Joe
  • 6,758
  • 2
  • 26
  • 47
  • I'm not sure if this makes a difference but I was using a 2d histogram. I edited the questions slightly because I realized I was pasted the wrong bit of code. I am using scipy.binned_statistics2d not matplotlib.hist2d() – GeoMonkey Jun 03 '20 at 20:43
  • Not sure if it works out of the box, but the idea in general will work. I will look at it later. – Joe Jun 03 '20 at 21:34