0

I'm making a boxplot correlation between 2 sequences begin with 5 characters: "aaaaa" and "ttttt" on total 7 characters. My data is like this:

  • Trait: G1 G2 G3 G4 ...
  • pUUEP9.16_Seq: aaaaaaa aaaaaat ttttttt tttttta...
  • RPPUE: 1.43 1.55 1.62 1.74 ...

But when I plot by ggplot2, it appears alot of boxplots due to many different sequences. When i just plot by boxplot, i can use grepl("^aaaaa", pUUEP9.16_Seq) to merge all the rest characters if their 5 beginning characters are the same. And do the similar with ttttt to compare them. But how can I do this in ggplot2? Thank you very much!
My code is like this:

library(ggplot2)  
library(tidyverse)  
All_Related_Traits_Haplotype %>%   
        ggplot( aes( All_Related_Traits_Haplotype$pUUEP9.16_Seq,  
                All_Related_Traits_Haplotype$RPPUE,   
                fill = All_Related_Traits_Haplotype$pUUEP9.16_Seq))  +  

        geom_boxplot(outlier.shape = NA, alpha= 0.5) +

        theme(legend.position = 100, plot.title = element_blank() , axis.title = element_blank(),
              title = element_blank(), axis.text = element_text(size = 15)) +

        scale_fill_brewer(palette = "Set1") 

pogibas
  • 27,303
  • 19
  • 84
  • 117
  • 4
    ggplot is not for data manipulation. Do enough preprocessing to create factor variables to do the grouping before reaching for ggplot. And if you need help in doing that, first search SO and google for examples, and then if not successful post example data. – IRTFM Apr 28 '20 at 04:56

1 Answers1

0

and welcome to SO!

Example dataset

You have not posted a dataset, so I created one which mimics your data with the general structure as follows:

  sequences         y
1   aaaaaaa 0.3692316
2   aaaaaaa 0.3344723
3   aaaaaaa 0.3364015
4   aaaaaaa 0.2718506
5   aaaaaaa 0.5482466
6   aaaaaaa 0.2147532

Where df$sequences contains sequences starting with either "aaaa" or "tttt", 7 different "end sequences" of three characters added to that, for a total of 14 unique sequences. df$y is my y values, created via rnorm(), where each unique sequences has a different mean value. I then have 10 repetitions of each unique sequence for a total of 140 observations. Full example dataset is available via the code at the bottom of this answer.

Basic Boxplots

You mention you can show individual boxplots, but want to analyze by grouping the sequences according to "starting" sequence, here we'll say the first 4 characters. We'll, get to that, but the first thing to do is show the example boxplot for the individual 14 unique sequences in the example dataset:

ggplot(df, aes(sequences, y)) +
    geom_boxplot() + coord_flip() + theme_bw()

enter image description here

I'm flipping horizontal so it's easier to see the sequences. Now, to address your question, which is how to analyze the sequences by grouping those starting with "aaaa" and "tttt" in ggplot. The issue here is that you should be doing that outside of ggplot, then take your prepared dataset and use ggplot to analyze.

When preparing your data, it's very important to note that you do not want to remove or alter the dataset: just provide ways of categorizing and labeling so that you can group as needed within ggplot or elsewhere. This is an important point, since you mention you want to "merge" the data which starts with "aaaa" together, etc. The key here is just to "label" the data, because labeling does not alter the data, whereas "merging" ends up removing data via simplification. For labeling, we'll add another column to the example dataset called "seq.cat" by representing the first 4 characters in each object of df$sequences using the substr function. Thereafter, I then coerce this column into a factor for easy grouping with ggplot later:

df$seq.cat <- substr(df$sequences, 1,4)
df$seq.cat <- factor(df$seq.cat)

Plotting by Categories

Now, we can use ggplot to create boxplots for the entire category instead of individual sequences:

ggplot(df, aes(seq.cat, y)) +
    geom_boxplot() + coord_flip() + theme_bw()

enter image description here

Cool, but what about seeing the individual sequences in there... grouped by their "seq.cat" value? To do that, we can utilize the position= argument of geom_boxplot, and "dodge" each of the boxes. "Dodging" is the process by which ggplot draws geoms according to one x aesthetic, but splits those geoms according to another aesthetic. Certain geoms can figure out what other aesthetic to use for dodging, but often you specify this by using a group= aesthetic. Here, we'll show the same boxplots according to df$seq.cat, but we will "dodge" the boxplots according to their df$sequences value:

ggplot(df, aes(seq.cat, y)) +
    geom_boxplot(aes(group=sequences, fill=seq.cat)) +
    coord_flip() + theme_bw()

enter image description here

You can even get creative and show the overall boxplot "under" the individual sequence boxplots if you want - mostly just to showcase to you the types of things you can use in ggplot to do this and to indicate how the ordering of geoms in the ggplot() call results in layers of geoms on top of one another:

ggplot(df, aes(seq.cat, y)) +
    geom_boxplot(aes(fill=seq.cat), alpha=0.2) +
    geom_boxplot(aes(group=sequences, fill=seq.cat)) +
    coord_flip() + theme_bw()

enter image description here

Hope that's enough for you to go on to answer your question. Note that you need to do preprocessing of the data beforehand - again being sure to preserve the unaltered data and only provide different ways that you can group the data (which does not change your actual data).

(ASIDE) In the future, please try to post some example dataset in order to have a minimal reproducible example. Typically, this means posting your code (which you did), but also an example of the output (image) when possible, and especially some data frame that goes along with your code. The data frame can be your data, a subset of your data, or another example dataset that illustrates your question. If you put dput(your.data.frame) into the console, the output can be pasted right into your question to allow SO'ers to use the same data frame.

Example Dataset used:

structure(list(sequences = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 
6L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 
8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 9L, 9L, 9L, 9L, 9L, 9L, 
9L, 9L, 9L, 9L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 
10L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 12L, 12L, 
12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 13L, 13L, 13L, 13L, 13L, 
13L, 13L, 13L, 13L, 13L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 
14L, 14L), .Label = c("aaaaaaa", "aaaaaat", "aaaaata", "aaaaatt", 
"aaaatat", "aaaatta", "aaaattt", "ttttaaa", "ttttaat", "ttttata", 
"ttttatt", "tttttat", "tttttta", "ttttttt"), class = "factor"), 
    y = c(0.369231635284751, 0.334472301615182, 0.336401541613538, 
    0.271850593628982, 0.548246553920441, 0.214753232428961, 
    0.272931216681757, 0.389963934540836, 0.161901134711628, 
    0.287687044414413, 0.559412668094905, 0.511042084622797, 
    0.597587755059323, 0.500017456043711, 0.561501596281245, 
    0.458014405582298, 0.808128498550588, 0.66375054115016, 0.599075775610392, 
    0.443971210496118, 0.592652837891229, 0.700878174322006, 
    0.805834612463509, 0.73297398303544, 0.587026084652387, 0.72117372267475, 
    0.73706613409661, 0.72568140022593, 0.679518188568345, 0.75535253075742, 
    0.721895352310081, 0.528850684559142, 0.650718664272936, 
    0.734761143215842, 0.594052856201981, 0.604613813151693, 
    0.639617263646129, 0.615627312141315, 0.697516133026403, 
    0.546807852963315, 0.0419542045093033, 0.0530867427743355, 
    -0.0295728194080514, 0.186503734298246, 0.137057493792742, 
    0.0889750527784836, 0.0645351842041302, 0.0410801985961975, 
    0.0544020482151214, 0.0095233043656552, 1.00261072509423, 
    1.0061820771478, 0.93384301837102, 0.862063619357913, 0.958852725861385, 
    0.886465265255992, 1.08645738913725, 0.938979900458647, 0.909544023142776, 
    1.04298470598537, 0.18097909664628, 0.188127809291404, 0.369707831198092, 
    0.194019386606537, 0.231499528208205, 0.278539551245994, 
    0.244508784125559, 0.225239703704488, 0.316376968568025, 
    0.306888135885525, 1.08321647786231, 0.741250605755319, 0.920826657383399, 
    0.909790347962312, 0.857890202217418, 0.834109623218618, 
    0.821643124416019, 0.878543781969851, 0.770070846649379, 
    0.758540936804054, 0.814560830961658, 1.18351784603039, 0.941264140150978, 
    0.864331965800611, 1.0516487843867, 0.987107258914654, 0.896314482641831, 
    0.825699584991788, 1.04032881969714, 0.921906270597259, 0.286468632253135, 
    0.270909870912734, 0.27180172080904, 0.227302339317363, 0.20278286085882, 
    0.235233019656869, 0.409427334942824, 0.103457260357185, 
    0.374943122272895, 0.169149938998089, 0.139420655744854, 
    0.152160895214924, 0.173794178787149, 0.409061439157534, 
    0.202183092752316, 0.329908116944302, 0.079264916788022, 
    0.11868462962438, 0.113313604373663, 0.235918265868379, 0.472409229186149, 
    0.332292533095422, 0.314459451306904, 0.531725824796639, 
    0.401415131485931, 0.673040222849771, 0.511928186282114, 
    0.379449838394305, 0.406302903005807, 0.330168688299693, 
    0.697001028015928, 0.240944888047631, 0.36862679632926, 0.656175495866837, 
    0.74006385762291, 0.425231842730887, 0.644456396087279, 0.368047727818937, 
    0.652041334699297, 0.318438638976521, 0.693903486329515, 
    0.916059702358207, 0.837186565483507, 0.731343897682531, 
    0.737129367978127, 0.816520705268809, 0.660761720816765, 
    0.799788442176542, 0.619028474247718, 0.76733836467068), 
    seq.cat = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("aaaa", 
    "tttt"), class = "factor")), row.names = c(NA, -140L), class = "data.frame")
chemdork123
  • 12,369
  • 2
  • 16
  • 32