1

I am using the boxplot function in R 3.1.1, and I am trying to understand what is happening behind the scenes rather than fix my code.

png(file = "plot1.png")
par(mfrow= c(1,2))
par(mar = c(3,4,4,1))
boxplot(emissions ~ year, col = "blue", xlab="Year", ylab ="Emissions", title = "Pm25 Emissions 1999 and 2008", bg ="white",ylim=c(0,6000))
boxplot(emissions2 ~ year2, col = "blue", xlab="Year", ylab ="Emissions", title = "Pm25 Emissions per Year", bg ="white",ylim=c(0,6000))
dev.off()

The resulting output is:

enter image description here

Under most situations from what I have read, the code should return a box and whiskers, but it is returning this linear mess of aligned dots that are no better than a bar chart. Any clues on what I have done wrong?

Thanks. The image is not posted as that I don't have 10 reputation points.

Full code to upload data set for automated and temporary processing.

url = "https://d396qusza40orc.cloudfront.net/exdata%2Fdata%2FNEI_data.zip"
#######Erased to encourage the learning process...
NEI <- readRDS(mydata[2])
SCC <- readRDS(mydata[1])
year <- (NEI[,6])
emissions <-( NEI[,4])
mat <- cbind(year,emissions)
png(file = "plot1.png")
....

Summary(NEI) results:

Emissions
Min : 0.0
1st Qu.: 0.0
Median : 0.0
Mean : 3.4
3rd Qu.: 0.1
Max. :646952.0

       year     
Min.   :1999  

1st Qu.:2002
Median :2005
Mean :2004
3rd Qu.:2008
Max. :2008

Aaron
  • 317
  • 4
  • 16
  • And the data? Read [this](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) for a general idea on how to provide a good example, and enhance your change to get a great answer. – Paulo E. Cardoso Sep 14 '14 at 18:06
  • the image would help. The data itself is format | year | emission| and runs into the millions of rows. If you'd like, the script has been automated to download into temporary file from a coursera link and produce the image – Aaron Sep 14 '14 at 18:13
  • Could you provide a summary of the data? It seems there is an issue with data distribution. – Paulo E. Cardoso Sep 14 '14 at 18:16
  • try to paste the `summary()` – Paulo E. Cardoso Sep 14 '14 at 18:20
  • Thanks Paulo. You nailed it. – Aaron Sep 14 '14 at 18:29

2 Answers2

1

As you may have noticed, your NEI variable is strongly skewed.

library(dplyr)
nei <- as.tbl(NEI)
nei%>%
  group_by(year) %>%
  summarise(
    min = min(Emissions),
    max = max(Emissions),
    mean = mean(Emissions),
    median = median(Emissions),
    Q25 = quantile (Emissions, probs=0.25),
    Q75 = quantile (Emissions, probs=0.75)
    )

the summary

Source: local data frame [4 x 7]

  year min       max     mean      median          Q25        Q75
1 1999   0  66696.32 6.615401 0.040000000 0.0100000000 0.25600000
2 2002   0 646951.97 3.317747 0.007164684 0.0005436423 0.08000000
3 2005   0  58896.10 3.182719 0.006741885 0.0005283287 0.07000000
4 2008   0  20799.70 1.752560 0.005273130 0.0003983980 0.06162755
Paulo E. Cardoso
  • 5,778
  • 32
  • 42
0

boxplot is a representation of your data distribution. More preiscely it depends in your data quantiles values.

For example, if yours quantiles overlaps , you will have only one horizontal line( the box and whisker is flat) and your outliers as a vertical line of points.

You can easily imagine your data distibuted like this example:

set.seed(1)
boxplot(count ~ spray, 
        data = data.frame(count=c(rep(0,800),runif(200)),
                          spray=sample(1:2,1000,rep=TRUE)), col = "lightgray")

enter image description here

agstudy
  • 119,832
  • 17
  • 199
  • 261