Why the histogram look different using two different breaks argument in R?

Question

I want to plot the distribution of the datasets using the histogram in R. I tried using different arguments (default, Freedman-Diaconis, and Scott) to get the best representation. I consider using a log scale later, but first I want to know the raw distribution without any scaling. However, the results look different, why is that? The dataset I use can be downloaded from here data or here data. The code I'm running are

hist(as.matrix(deviation_all_genes_all_spots), xlim = c(-(1*10^(4)), 10^(4.5)), breaks = 200)

result is

hist(as.matrix(deviation_all_genes_all_spots), xlim = c(-(1*10^(4)), 10^(4.5)), breaks = "Scott")

Result is

hist(as.matrix(deviation_all_genes_all_spots), xlim = c(-(1*10^(4)), 10^(4.5)), breaks="Freedman-Diaconis")

result is

Please help. Thank you very much.

The breaks are much smaller in the last plot. You can also check the underlying functions `nclass.scott()` and `class.FD()` — Roman, Jul 04 '22 at 11:57
@Roman I have checked them and got the number of classes are 7518, 114767, and 28 for Scott, FD, and Sturges, respectively. However, I still do not get why the former two arguments only show the distribution of non-negative values? — MK Huda, Jul 04 '22 at 20:05
are you sure? Have you checked the breaks like in the answer below? Try also to zoom in with suitable xaxis limits. — Roman, Jul 05 '22 at 07:58
@Roman Yes, I am sure. I zoom in already, and it only started from 0 and positives. — MK Huda, Jul 05 '22 at 08:24
Your dataset requires a user to request access on Google Drive. Please publish it somewhere where it can be accessed immediately. — Caspar V., Jul 05 '22 at 23:33
@CasparV. I have updated it. Hopefully you can access it now. — MK Huda, Jul 06 '22 at 12:23
From which column in the data do you want to plot the histogram? There is no column "eviation_all_genes_all_spots". — cdalitz, Jul 06 '22 at 12:48
@cdalitz I want to plot the distribution of all values across all columns and rows. — MK Huda, Jul 06 '22 at 13:02
The data you provide cannot be the data that you show in your plots. When I load the data into a varible *x* with `read.csv(..., header=T)`, convert the non-factorial columns into a numeric vector with `y <- as.numeric(as.matrix(x[,-1]))`, then `range(y)` returns [-0.04622332, 0.08013825], and not something that goes into the ten thousands. — cdalitz, Jul 06 '22 at 13:24
@MKHuda: Show us how you created `deviation_all_genes_all_spots` from the CSV. I see you're not new here, but please take the time to read [how to ask a good question](https://stackoverflow.com/help/how-to-ask), and check out the answers to [How to make a great R reproducible example?](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). — Caspar V., Jul 06 '22 at 22:46

score 1 · Answer 1 · answered Jul 04 '22 at 12:02

1

Histograms are very sensitive to the choice of cell break points. Even for the same (!) number of cells, the histogram can become considerably different by just a small shift of the cell borders. It is thus generally preferable to use kernel density estimators instead of histograms, because they do not depend on random cell border placement:

# increase n if you have a wide range of values
d <- density(as.matrix(deviation_all_genes_all_spots), n=512)
plot(d$x, d$y)

In your second and third call of hist, you ask for an automatic way to select the number of cells and the cell borders. Obviously, this results in more cells than in your first call with breaks=200. You can query the cells from the return value of hist, e.g.

h <- hist(as.matrix(deviation_all_genes_all_spots))
cat(srintf("number of cells = %i\n", length(h$mids))

answered Jul 04 '22 at 12:02

cdalitz

1,019
1
6
7

That is true that the method will select the number of cells automatically. However, the issue is why it only shows non-negative value distribution, ignoring the negative values? The number of cells using the code you suggested is relatively low, only 34. The density is difficult to understand the distribution though, I have tried it. – MK Huda Jul 04 '22 at 19:24
Can you please make the data available, so that I can test it? The URL in the question requires a Google Drive account. – cdalitz Jul 05 '22 at 12:27
I have updated it. Hopefully, you can access it now. – MK Huda Jul 06 '22 at 12:23

Why the histogram look different using two different breaks argument in R?

1 Answers1