1

I have a large array of data, for which I need to get the histogram without the bin with the highest frequency. I use this, to remove such bin, but then I need to save the changed histogram, since I have to compare it with another histogram. I don't know how to do this, since the initial data has not been changed, and I can only see the change in the presentation. I was thinking of somehow manipulating the initial data to reflect such a change in the histogram (like removing those data that appear in the bin with the highest frequency), but what I have tried so far doesn't work. This is a sample code, mainly based on the above link, with a few changes to work for my purpose, which unfortunately doesn't do the job:

import numpy as np
import matplotlib.pyplot as plt

gaussian_numbers = np.random.randn(100)

# Get histogram
values, bin_edges = np.histogram(gaussian_numbers, bins=6)
centers = (bin_edges[:-1] + bin_edges[1:]) / 2
width = (bin_edges[1] - bin_edges[0])
plt.bar(centers, values, color="blue",align='center',width=width)
plt.show()

values[np.where(values == np.max(values))] = 0
binCenters =(bin_edges[:-1] + bin_edges[1:]) / 2

plt.bar(binCenters, values, color="blue",align='center', width=width)  
plt.show()

new=gaussian_numbers[(gaussian_numbers!= np.max(values))]
print np.sum(new-gaussian_numbers)

I can see the bin with the highest frequency has been removed when I draw the bar graph. But, when I try to remove such values from my data and save it in an array called new (then I want to save the histogram of new) there is no difference between new and gaussian_numbers. This means their histograms are the same as well. Is there any way to remove such data?

Community
  • 1
  • 1
Miranda
  • 565
  • 1
  • 10
  • 27

1 Answers1

0

I think I figured out how to do it. Basically, I find the range of bin for which the histogram has the highest frequency and then remove it from the original data. Here is the sample code for those who are interested:

import numpy as np
import matplotlib.pyplot as plt

gaussian_numbers = np.random.randn(100)
print gaussian_numbers.shape
# Get histogram
values, bin_edges = np.histogram(gaussian_numbers, bins=6)
centers = (bin_edges[:-1] + bin_edges[1:]) / 2
width = (bin_edges[1] - bin_edges[0])
plt.bar(centers, values, color="blue",align='center',width=width)
plt.show()


bin_min= bin_edges[np.where(values == np.max(values))]
bin_max= bin_min +width
new_val = gaussian_numbers[(gaussian_numbers<bin_min) | (gaussian_numbers>bin_max)]


values, bin_edges = np.histogram(new_val, bins=6)

centers = (bin_edges[:-1] + bin_edges[1:]) / 2
width = (bin_edges[1] - bin_edges[0])
plt.bar(centers, values, color="blue",align='center',width=width)
plt.show()

This is the before and after bar graphs:

enter image description here

enter image description here

Notice that now I can save the new histogram, since I have the new data saved after removing the highest frequency bin in the initial histogram. Also, notice that the initial and final bins has to be equal to observe the bin for which the data has been removed.

Miranda
  • 565
  • 1
  • 10
  • 27