5

How do you calculate the mean values for bins with a 2D histogram in python? I have temperature ranges for the x and y axis and I am trying to plot the probability of lightning using bins for the respective temperatures. I am reading in the data from a csv file and my code is such:

filename = 'Random_Events_All_Sorted_85GHz.csv'
df = pd.read_csv(filename)

min37 = df.min37
min85 = df.min85
verification = df.five_min_1

#Numbers
x = min85
y = min37
H = verification

#Estimate the 2D histogram
nbins = 4
H, xedges, yedges = np.histogram2d(x,y,bins=nbins)

#Rotate and flip H
H = np.rot90(H) 
H = np.flipud(H)

#Mask zeros
Hmasked = np.ma.masked_where(H==0,H)

#Plot 2D histogram using pcolor
fig1 = plt.figure()
plt.pcolormesh(xedges,yedges,Hmasked)
plt.xlabel('min 85 GHz PCT (K)')
plt.ylabel('min 37 GHz PCT (K)')
cbar = plt.colorbar()
cbar.ax.set_ylabel('Probability of Lightning (%)')

plt.show()

This makes a nice looking plot, but the data that is plotted is the count, or number of samples that fall into each bin. The verification variable is an array that contains 1's and 0's, where a 1 indicates lightning and a 0 indicates no lightning. I want the data in the plot to be the probability of lightning for a given bin based on the data from the verification variable - thus I need bin_mean*100 in order to get this percentage.

I tried using an approach similar to what is shown here (binning data in python with scipy/numpy), but I was having difficulty getting it to work for a 2D histogram.

Community
  • 1
  • 1
mbreezy
  • 99
  • 1
  • 5
  • Why is this question marked as a duplicate? This is related to 2d histograms. As well, the selected answer should be changed. @Alleo's answer is *much* simpler. – akozi Oct 17 '18 at 14:33
  • because histograms are a specific case of binning (binning = group points by values of independent variables and compute a statistics of choice on the values of dependent variable for each group; histogram = binning where the statistic of choice is "count"). Indeed the linked answers work also in this case. I agree Alleo's answer is good. I am not sure if histogram2d was not always there or if I had missed it in the past. – Vincenzooo Jul 28 '21 at 10:35

2 Answers2

8

There is an elegant and fast way to do this! Use weights parameter to sum values:

denominator, xedges, yedges = np.histogram2d(x,y,bins=nbins)
nominator, _, _ = np.histogram2d(x,y,bins=[xedges, yedges], weights=verification)

So all you need is to divide in each bin the sum of values by the number of events:

result = nominator / denominator.clip(1)

Voila!

Alleo
  • 7,891
  • 2
  • 40
  • 30
1

This is doable at least with the following method

# xedges, yedges as returned by 'histogram2d'

# create an array for the output quantities
avgarr = np.zeros((nbins, nbins))

# determine the X and Y bins each sample coordinate belongs to
xbins = np.digitize(x, xedges[1:-1])
ybins = np.digitize(y, yedges[1:-1])

# calculate the bin sums (note, if you have very many samples, this is more
# effective by using 'bincount', but it requires some index arithmetics
for xb, yb, v in zip(xbins, ybins, verification):
    avgarr[yb, xb] += v

# replace 0s in H by NaNs (remove divide-by-zero complaints)
# if you do not have any further use for H after plotting, the
# copy operation is unnecessary, and this will the also take care
# of the masking (NaNs are plotted transparent)
divisor = H.copy()
divisor[divisor==0.0] = np.nan

# calculate the average
avgarr /= divisor

# now 'avgarr' contains the averages (NaNs for no-sample bins)

If you know the bin edges beforehand, you can do the histogram part in the same just by adding one row.

DrV
  • 22,637
  • 7
  • 60
  • 72
  • `xbins = np.digitize(x, xedges[1:-1]) ybins = np.digitize(y, yedges[1:-1])` why do you index xedges and y edges? – Rish Dec 15 '20 at 20:50