1

This problem has been brought up a million times on stacko but I couldn't seem to find a solution that tailored to my particular problem.

I have a data frame which includes a column of species and a column of genome_names:

species                  genome_name
Acinetobacter baumannii  Acinetobacter baumanii BIDMC 56 
Acinetobacter baumannii  Acinetobacter baumannii 1032359
Klebsiella pneumoniae    Klebsiella pneumoniae CHS 30
etc...

Using this code I created a barplot of species with a height of genome_name:

library(ggplot2)
ggplot(PATRIC_genomes_AMR_2_ris_subset,aes(x=species,fill=genome_name)) + 
  geom_bar(colour="black") + scale_colour_continuous(guide = FALSE) + 
  labs(title="Number of unique strains") +
  labs(x = "Species",y="#Strains") + theme(legend.position="none") + 
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) 

I would like to order this barplot in increasing value of y (number of genome_name). I blindly attempted to do this by putting my data in a factor to no avail:

Error in `[<-.data.frame`(`*tmp*`, del, value = NULL) : 
missing values are not allowed in subscripted assignments of data frames
Hack-R
  • 22,422
  • 14
  • 75
  • 131
  • 1
    https://docs.google.com/spreadsheets/d/16oHo85Pb8PVX2VqxlqEHizn10H3jVdjRC-kDrELcOfs/edit?usp=sharing – Daniel Harris Aug 19 '16 at 16:47
  • Have you attempted the solutions to [this question](http://stackoverflow.com/questions/5208679/order-bars-in-ggplot2-bar-graph)? – aosmith Aug 19 '16 at 16:59
  • Here is the exact code copied: ggplot(PATRIC_genomes_AMR_2_ris_subset,aes(x=species,fill=genome_name)) + geom_bar(colour="black") + scale_colour_continuous(guide = FALSE) +labs(title="Number of unique strains") +labs(x = "Species",y="#Strains") + theme(legend.position="none") + theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) – Daniel Harris Aug 19 '16 at 16:59
  • @aosmith yes, that is the factor attempt. I probably am not understanding how to apply his answer to my problem. – Daniel Harris Aug 19 '16 at 17:00
  • @DanielHarris Did it seem to run for a long time or did it throw the error right away? I am running the code from your comment and it seems to just be running for a long time – Hack-R Aug 19 '16 at 17:02
  • @Hack-R It is a very long run time. – Daniel Harris Aug 19 '16 at 17:03
  • I'd like to order the species before running ggplot. That way we don't have to run the data for long periods of time. – Daniel Harris Aug 19 '16 at 17:05
  • Good idea! You can order it like this `PATRIC_genomes_AMR_2_ris_subset <- PATRIC_genomes_AMR_2_ris_subset[order(PATRIC_genomes_AMR_2_ris_subset$species),]` – Hack-R Aug 19 '16 at 17:08
  • 1
    @Hack-R Thanks for your time and help! I'll use the code in your answer. It's fantastic. – Daniel Harris Aug 19 '16 at 17:22
  • Happy to help :) Cheers – Hack-R Aug 19 '16 at 17:22

3 Answers3

1

reorder the factor levels before ploting:

df$species <- reorder(df$species, df$ge‌​nome_name)

Edit: My bad for not looking at the data more closely. This plots the number of unique strains sorted by number.

library(dplyr)
library(ggplot2)

df %>%
  group_by(species) %>%
  summarise(unique_strains = length(unique(genome_name))) %>%
  mutate(species = reorder(species, unique_strains)) %>%
  ggplot(aes(species, unique_strains)) + geom_bar(stat = "identity") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) + 
  xlab(NULL) +
  scale_y_log10()
Tyler Moss
  • 11
  • 3
1
library(ggplot2)
PATRIC_genomes_AMR_2_ris_subset <- read.csv("genomes_subset.csv", header = T)
PATRIC_genomes_AMR_2_ris_subset <- dplyr::sample_n(PATRIC_genomes_AMR_2_ris_subset, 300)

PATRIC_genomes_AMR_2_ris_subset <- PATRIC_genomes_AMR_2_ris_subset[order(PATRIC_genomes_AMR_2_ris_subset$species),]


# Order by genome_name
PATRIC_genomes_AMR_2_ris_subset <- within(PATRIC_genomes_AMR_2_ris_subset, 
                   Position     <- factor(genome_name, 
                                      levels=names(sort(table(genome_name), 
                                                        decreasing=TRUE))))

enter image description here

ggplot(PATRIC_genomes_AMR_2_ris_subset,aes(x=species,fill=genome_name)) + 
  geom_bar(colour="black") + scale_colour_continuous(guide = FALSE) + 
  labs(title="Number of unique strains") +
  labs(x = "Species",y="#Strains") + theme(legend.position="none") + 
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) 

# Order by species
PATRIC_genomes_AMR_2_ris_subset <- within(PATRIC_genomes_AMR_2_ris_subset, 
                                          species <- factor(species, 
                                                         levels=names(sort(table(species), 
                                                         decreasing=TRUE))))

ggplot(PATRIC_genomes_AMR_2_ris_subset,aes(x=species,fill=genome_name)) + 
  geom_bar(colour="black") + scale_colour_continuous(guide = FALSE) + 
  labs(title="Number of unique strains") +
  labs(x = "Species",y="#Strains") + theme(legend.position="none") + 
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5)) 

enter image description here

This is pretty much the same as this but with yours you mentioned ordering it by the fill value, genome_name, which is a little different and we also got to see how the ordering affects the run time, so it's not a duplicate.

Community
  • 1
  • 1
Hack-R
  • 22,422
  • 14
  • 75
  • 131
  • I am getting a error after that first block of code (the factor creation): Error in `[<-.data.frame`(`*tmp*`, del, value = NULL) : missing values are not allowed in subscripted assignments of data frames – Daniel Harris Aug 19 '16 at 17:26
  • @DanielHarris I didn't get that message but it's probably because I sampled the data. What you can do is either (a) use `complete.cases` and exclude the missing values or (b) use either `mean` or `RRF::na.roughfix` to impute the missing values if they are numeric. For missing factor variables either use `addNA` or you can impute them by mode (or exclude the rows with missing values). – Hack-R Aug 19 '16 at 17:29
  • @DanielHarris Here's a little function I wrote to automatically impute missing factor values: `MaxTable <- function(InVec, mult = FALSE) { if (!is.factor(InVec)) InVec <- factor(InVec) A <- tabulate(InVec) if (isTRUE(mult)) { levels(InVec)[A == max(A)] } else levels(InVec)[which.max(A)] }` – Hack-R Aug 19 '16 at 17:31
  • Is there any way to change the NULLs to NA? I'm not sure I understand the complete.cases solution. – Daniel Harris Aug 19 '16 at 17:34
  • Cool let me try that. – Daniel Harris Aug 19 '16 at 17:34
  • Where do I put my data frame or value names in your function? I'm sorry I'm a little confused. – Daniel Harris Aug 19 '16 at 17:36
  • @DanielHarris If the NULLs are encoded in a way that `is.null` recognizes then you can do `df$var[is.null(df$var)] <- NA` if that doesn't work then you can use `gsub`. BTW if you don't mind could you mark this answer as the solution? :) – Hack-R Aug 19 '16 at 17:36
  • @DanielHarris `df$my_factor[is.na(df$my_factor)] <- MaxTable(df$my_factor)` – Hack-R Aug 19 '16 at 17:37
  • is.null(PATRIC_genomes_AMR_2_ris_subset) returns false. But I'm checking that wrong? – Daniel Harris Aug 19 '16 at 17:48
  • @DanielHarris Yea you can't use that function on an entire data set in that way. It's just telling you that the entire data set isn't NULL. I would recommend to do this `df$var[is.null(df$var)] <- NA` . You can probably do it against all columns at once by using `apply`, `sapply`, a loop, etc. If you want to check a column do this `table(is.null(df$my_col))` – Hack-R Aug 19 '16 at 17:55
  • I applied PATRIC_genomes_AMR_2_ris_subset <-!is.null(PATRIC_genomes_AMR_2_ris_subset) and now the first block of code (in your answer) is returning: Error in UseMethod("within") : no applicable method for 'within' applied to an object of class "logical" – Daniel Harris Aug 19 '16 at 17:56
  • @DanielHarris That's not how `is.null` works ;) so the result you're getting is just a `TRUE` or `FALSE` not your data. I think you were trying to do `my_data <- my_data[!is.null(my_data),]` Just remember that `is.anything` will always return a logical value, so you use it for indexing or counting but by itself it doesn't return data. – Hack-R Aug 19 '16 at 17:57
  • Actually could I just make an sapply function that sums the number of genome_name per species in a vector. Then use sort(vector, decreasing = TRUE) on that df? – Daniel Harris Aug 19 '16 at 18:23
  • @DanielHarris Sure, I'm sure you could. Just remember to use `na.rm=T` in the sum / aggregation function when you try it. – Hack-R Aug 19 '16 at 18:38
0

To order the bars, set species to a factor with the levels sorted by occurrences.

Plotting is taking so long because you're actually drawing a bar for every pair of species and genome_name that occurs (12,339 of them, to be precise), and stacking the bars by species. If you just want black bars, if you take out the fill aesthetic, ggplot can aggregate much more quickly, as it is only drawing one bar per species:

# download data
df <- gsheet::gsheet2tbl('https://docs.google.com/spreadsheets/d/16oHo85Pb8PVX2VqxlqEHizn10H3jVdjRC-kDrELcOfs/edit#gid=1638547987')

ggplot(df, aes(x = factor(species, names(sort(-table(species)))))) + 
    geom_bar(colour = "black") + 
    labs(title = "Number of unique strains") +
    labs(x = "Species", y = "#Strains") + 
    theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) 

plot with black bars

If you plot with a fill aesthetic with the same approach, you'll only get black bars anyway, as the colour aesthetic set in geom_bar is putting a black stroke around each stacked bar, which given how small they are is covering up the filled color. One way to avoid the issue is to simply take out colour = "black":

ggplot(df, aes(x = factor(species, names(sort(-table(species)))), fill = genome_name)) + 
    geom_bar() + 
    labs(title = "Number of unique strains") +
    labs(x = "Species", y = "#Strains") + 
    theme(legend.position = "none",
          axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) 

plot with colored bars

If you really want a black stroke on each stacked bar, you'll need to set size to something small enough that the fill is not covered by the stroke:

ggplot(df, aes(x = factor(species, names(sort(-table(species)))), fill = genome_name)) + 
    geom_bar(colour = "black", size = 0.01) + 
    labs(title = "Number of unique strains") +
    labs(x = "Species", y = "#Strains") + 
    theme(legend.position = "none",
          axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) 

plot with colored bars with black stroke

alistaire
  • 42,459
  • 4
  • 77
  • 117