0

I need help figuring out why there is a discrepancy in the histogram A and B generated in the code below. I'm a physicist and some colleagues and me noted this as we were plotting the same data in python, IDL and Matlab. Python and IDL have the same problem, however Matlab does not. Matlab always reproduce histogram B.

import numpy as np
import matplotlib.pyplot as plt

t = np.random.randint(-1000,1000,10**3)

# A
tA = t/1000
binsizeA = 0.05
xminA = -1
xmaxA = 1
binsA = np.arange(xminA, xmaxA+binsizeA, binsizeA)
hA, _ , _ = plt.hist(tA, bins=binsA, histtype="step", label="A")

# B
tB = t
binsizeB = 50
xminB = -1000
xmaxB = 1000
binsB = np.arange(xminB, xmaxB+binsizeB, binsizeB)
hB, _ , _ = plt.hist(tB/1000, bins=binsB/1000, histtype="step", label="B")


plt.legend()
plt.show()

print(hA==hB)

Plot showing the histograms

The original data are time tagged measurements with microsecond presision saved as integers. The problems seems to be when the array are divided by 1000 (from microsecond to millisecond). Is there a way to avoid this?

Anders
  • 1

1 Answers1

0

I start by "recreating" scenario A, but directly by scaling everything (data + bins) from B:

C - binsB / 1000

# C
tC = tB / 1000
xminC = xminB / 1000
xmaxC = xmaxB / 1000
binsC = binsB / 1000
hC, _ , _ = plt.hist(tC, bins=binsC, histtype="step", label="C")

assert((hB == hC).all())

This produces the same histogram as hB, so the problem is in the way binsA is made:

binsA = np.arange(xminA, xmaxA+binsizeA, binsizeA)

From its docstring:

When using a non-integer step, such as 0.1, the results will often not be consistent. It is better to use numpy.linspace for these cases.

So either go route C or use linspace to create the non-integer bins with less rounding errors.

D - np.linspace

Interestingly, using linspace does not yield floating-point equal bins as binsB / 1000 does:

# D
tD = t / 1000
bincountD = 41
xminD = -1
xmaxD = 1
binsD = np.linspace(xminD, xmaxD, 41)

hC, _ , _ = plt.hist(tC, bins=binsC, histtype="step", label="C")
hD, _ , _ = plt.hist(tD, bins=binsD, histtype="step", label="D")

plt.legend()
plt.show()

By inspection, both binsC look equal to binsD, but still differ in their least signifcant digits. I can "clamp" them to yield the same histogram by binsX.round(2).

But in total, this serves as a reminder how tricky it is to achieve "exact" results. But note that this fact is amplified here, as all your samples were integers to begin with. If your data is floating point as well, bins and samples would not be value-identical.

ojdo
  • 8,280
  • 5
  • 37
  • 60
  • Thank you for a clear answer. I learned something new today. np.arange() should be used with care, and np.linspace() is a much more precise function. – Anders Apr 13 '21 at 12:39