1

I decided to go through my statistics courses, which are taught in SPSS, but do in R, as I would like to learn to do stats there. I am currently doing histograms for two numerical continuous varables data$alcohol (alcohol misuse scale score) and data$age but got stuck on the first one.

The main issues are:

  1. My histogram looks different from the picture in the answer sheet
  2. I cannot add a normal curve unless I change the aes to density, which I do not want to do, as the exercise asks for frequency

Here is what I wrote:

data <- read_excel("~/Dropbox/My Mac (jmbp.local)/Desktop/Kings College London/2021:2022/Statistics/Week 1 stats/cleandata.xlsx")

mean_alc <- mean(data$alcohol)
sd_alc <- sd(data$alcohol) 

p <- ggplot(data= data) + 
  geom_histogram(mapping = aes(x = alcohol, y=..count..),
                 breaks=seq(0, 20, by=1), 
                 col="black", 
                 fill="white", 
                 alpha = 1) + 
  labs(title="Alcohol Misuse Score", x="Alcohol Misuse Score", y="Frequency") + 
  xlim(c(0,20)) + 
  ylim(c(0,20)) +
  stat_function(fun = dnorm, colour = "red", args=list(mean = mean(data$alcohol), sd = sd(data$alcohol))) +
  plot(density(data$alcohol, bw = 0.05))
p

my histogram looks like this:

My histogram

the picture included in the solutions (done in SPSS) looks like this:

Histogram in the answer sheet

My first question is why the bars in my histogram look different than the ones in the answer sheet? is there some fundamanetal difference in how SPSS does histograms and how R does it? Secondly, is there a way to add a normal curve to the frequency histogram in ggplot 2? For reference this is how this can be done in SPSS:

Frequency histogram with normal curve in SPSS

the data$alcohol has the following values:

alcohol = c(15.78121, 17, 17.61943, 17.61943, 14.67395, 17.61943, 17, 17, 13.72413, 13.72664, 17, 15.86039, 17, 15.78121, 11.48049, 14.61672, 12.73437, 8, 17, 15.86039, 14.59133, 15.78121, 14.61672, 17, 17, 18, 15.78121, 10, 14.67395, 9, 7.033369, 17, 17, 15.86039, 15.78121, 18, 13.07577, 18, 8, 17.61943, 15.86039, 11.53364, 11.4323, 18, 6.390277, 17, 14.59133, 18, 14.9238, 15.78121, 14.61672, 17, 17.61943, 14.67395, 8, 18, 8, 17.61943, 14.4069, 6.477451, 7.02489, 18, 18, 13.09201, 15.78121, 14.59133, 18, 5.451102, 9, 4.801972, 15.86039, 15.86039, 17, 17, 17)
Julie
  • 11
  • 2
  • Could you add some reproducible (see her: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) data? Maybe you can use dput(), i.e. run for example dput(data %>% select(alcohol)) and copy the input in here :) – T. C. Nobel Oct 02 '21 at 14:18
  • @TrineCosmusNobel c(15.78121, 17, 17.61943, 17.61943, 14.67395, 17.61943, 17, 17, 13.72413, 13.72664, 17, 15.86039, 17, 15.78121, 11.48049, 14.61672, 12.73437, 8, 17, 15.86039, 14.59133, 15.78121, 14.61672, 17, 17, 18, 15.78121, 10, 14.67395, 9, 7.033369, 17, 17, 15.86039, 15.78121, 18, 13.07577, 18, 8, 17.61943, 15.86039, 11.53364, 11.4323, 18, 6.390277, 17, 14.59133, 18, 14.9238, 15.78121, 14.61672, 17, 17.61943, 14.67395, 8, 18, 8, 17.61943, 14.4069, 6.477451, 7.02489, 18, 18, 13.09201, 15.78121, 14.59133, 18, 5.451102, 9, 4.801972, 15.86039, 15.86039, 17, 17, 17) – Julie Oct 02 '21 at 14:27

1 Answers1

0

Inspired by an answer in this thread that might help you, by the way, here is a way you can at least get the normality curve like in SPSS. This utilises a rescaling of the density curve as proposed by one of the answers :).

I tried loading the data in SPSS and making the histogram there. I can see the difference. It seems to reovlve around how SPSS and ggplot combines the bins? In R, you specify the bin width with the breaks argument. However, in SPSS it is not specified and thus, SPSS determines it in another way. You can perhaps read about this here.

However it seems that the “issue” is with the ,0 values. For example, in the R script, 18 is actually collapsed with the decimal values for 17 - i.e. 18 and 17.61943 are collapsed and 17 and 15.78121 and 15.86039 are collapsed. In SPSS, 17 is collapsed with 17.62. I.e. it is the break points.

ggplot(data= data) + 
  geom_histogram(mapping = aes(x = alcohol),
                 breaks=seq(0, 20, by=1), 
                 col="black", 
                 fill="white", 
                 alpha = 1) + 
  labs(title="Alcohol Misuse Score", x="Alcohol Misuse Score", y="Frequency") + 
  xlim(c(0,20)) + 
  ylim(c(0,20)) +
  stat_function(fun = function(x) 
    dnorm(x, mean = mean(data$alcohol), sd = sd(data$alcohol)) * 0.5 * sum(!is.na(data$alcohol)), colour = "red")

enter image description here

T. C. Nobel
  • 465
  • 2
  • 9