R Setting Y Axis to Count Distinct in ggplot2

Question

I have a data frame that contains 4 variables: an ID number (chr), a degree type (factor w/ 2 levels of Grad and Undergrad), a degree year (chr with year), and Employment Record Type (factor w/ 6 levels).

I would like to display this data as a count of the unique ID numbers by year as a stacked area plot of the 6 Employment Record Types. So, count of # of ID numbers on the y-axis, degree year on the x-axis, the value of x being number of IDs for that year, and the fill will handle the Record Type. I am using ggplot2 in RStudio.

I used the following code, but the y axis does not count distinct IDs:

ggplot(AlumJobStatusCopy, aes(x=Degree.Year, y=Entity.ID,
       fill=Employment.Data.Type)) + geom_freqpoly() +
       scale_fill_brewer(palette="Blues",
       breaks=rev(levels(AlumJobStatusCopy$Employment.Data.Type)))

I also tried setting y = Entity.ID to y = ..count.. and that did not work either. I have searched for solutions as it seems to be a problem with how I am writing the aes code.

I also tried the following code based on examples of similar plots:

ggplot(AlumJobStatusCopy, aes(interval)) + 
      geom_area(aes(x=Degree.Year, y = Entity.ID, 
                    fill = Employment.Data.Type)) +
      scale_fill_brewer(palette="Blues",
          breaks=rev(levels(AlumJobStatusCopy$Employment.Data.Type)))

This does not even seem to work. I've read the documentation and am at my wit's end.

EDIT:

After figuring out the answer to the problem, I realized that I was not actually using the correct values for my Year variable. A count tells me nothing as I am trying to display the rise in a lack of records and the decline in current records.

My Dataset:
Year, int, 1960-2015
Current Record, num: % of total records that are current
No Record, num: % of total records that are not current

Ergo each Year value has two corresponding percent values. I am now using 2 lines instead of an area plot since the Y axis has distinct values instead of a count function, but I would still like the area under the curves filled. I tried using Melt to convert the data from wide to long, but was still unable to fill both lines. Filling is just for aesthetic purposes as I would like to use a gradient for each with 1 fill being slightly lighter than the other.

Here is my current code:

ggplot(Alum, aes(Year)) + 
    geom_line(aes(y = Percent.Records, colour = "Percent.Records")) +
    geom_line(aes(y = Percent.No.Records, colour = "Percent.No.Records")) + 
    scale_y_continuous(labels = percent) + ylab('Percent of Total Records') + 
    ggtitle("Active, Living Alumni Employment Record") +
    scale_x_continuous(breaks=seq(1960, 2014, by=5))

I cannot post an image yet.

Please provide a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input data so we can test the code and possible solutions. This will make it easier to help you, — MrFlick, Jul 31 '15 at 22:57
I can't test it, because you didn't provide any data, but try: `ggplot(AlumJobStatusCopy, aes(x=Degree.Year, y=Entity.ID, fill=Employment.Data.Type)) + geom_bar(stat="identity")` That should give you a stacked bar. — RHA, Aug 01 '15 at 08:31
@RHA, that code produces bars with one slice for each ID, colored by employment type. I think s/he wants a count of IDs by employment type. To get there with `y=Entity.ID`, I think you'd have to order the IDs by employment type. — ulfelder, Aug 01 '15 at 11:52

ulfelder · Answer 1 · 2015-08-01T11:38:43.910

I think you're missing a step where you summarize the data to get the quantities to plot on the y-axis. Here's an example with some toy data similar to how you describe yours:

# Make toy data with three levels of employment type
set.seed(1)
df <- data.frame(Entity.ID = rep(LETTERS[1:10], 3), Degree.Year = rep(seq(1990, 1992), each=10),
    Degree.Type = sample(c("grad", "undergrad"), 30, replace=TRUE),
    Employment.Data.Type = sample(as.character(1:3), 30, replace=TRUE))

# Here's the part you're missing, where you summarize for plotting 
library(dplyr)
dfsum <- df %>%
    group_by(Degree.Year, Employment.Data.Type) %>%
    tally()

# Now plot that, using the sums as your y values
library(ggplot2)
ggplot(dfsum, aes(x = Degree.Year, y = n, fill = Employment.Data.Type)) +
    geom_bar(stat="identity") + labs(fill="Employment")

The result could use some fine-tuning, but I think it's what you mean. Here, the bars are equal height because each year in the toy data include an equal numbers of IDs; if the count of IDs varied, so would the total bar height.

If you don't want to add objects to your workspace, just do the summing in the call to ggplot():

ggplot(tally(group_by(df, Degree.Year, Employment.Data.Type)),
    aes(x = Degree.Year, y = n, fill = Employment.Data.Type)) +
    geom_bar(stat="identity") + labs(fill="Employment")

R Setting Y Axis to Count Distinct in ggplot2

1 Answers1