and welcome to SO!
Example dataset
You have not posted a dataset, so I created one which mimics your data with the general structure as follows:
sequences y
1 aaaaaaa 0.3692316
2 aaaaaaa 0.3344723
3 aaaaaaa 0.3364015
4 aaaaaaa 0.2718506
5 aaaaaaa 0.5482466
6 aaaaaaa 0.2147532
Where df$sequences
contains sequences starting with either "aaaa"
or "tttt"
, 7 different "end sequences" of three characters added to that, for a total of 14 unique sequences. df$y
is my y values, created via rnorm()
, where each unique sequences has a different mean value. I then have 10 repetitions of each unique sequence for a total of 140 observations. Full example dataset is available via the code at the bottom of this answer.
Basic Boxplots
You mention you can show individual boxplots, but want to analyze by grouping the sequences according to "starting" sequence, here we'll say the first 4 characters. We'll, get to that, but the first thing to do is show the example boxplot for the individual 14 unique sequences in the example dataset:
ggplot(df, aes(sequences, y)) +
geom_boxplot() + coord_flip() + theme_bw()

I'm flipping horizontal so it's easier to see the sequences. Now, to address your question, which is how to analyze the sequences by grouping those starting with "aaaa"
and "tttt"
in ggplot
. The issue here is that you should be doing that outside of ggplot
, then take your prepared dataset and use ggplot
to analyze.
When preparing your data, it's very important to note that you do not want to remove or alter the dataset: just provide ways of categorizing and labeling so that you can group as needed within ggplot
or elsewhere. This is an important point, since you mention you want to "merge" the data which starts with "aaaa"
together, etc. The key here is just to "label" the data, because labeling does not alter the data, whereas "merging" ends up removing data via simplification. For labeling, we'll add another column to the example dataset called "seq.cat"
by representing the first 4 characters in each object of df$sequences
using the substr
function. Thereafter, I then coerce this column into a factor for easy grouping with ggplot
later:
df$seq.cat <- substr(df$sequences, 1,4)
df$seq.cat <- factor(df$seq.cat)
Plotting by Categories
Now, we can use ggplot
to create boxplots for the entire category instead of individual sequences:
ggplot(df, aes(seq.cat, y)) +
geom_boxplot() + coord_flip() + theme_bw()

Cool, but what about seeing the individual sequences in there... grouped by their "seq.cat" value? To do that, we can utilize the position=
argument of geom_boxplot
, and "dodge" each of the boxes. "Dodging" is the process by which ggplot
draws geoms according to one x aesthetic, but splits those geoms according to another aesthetic. Certain geoms can figure out what other aesthetic to use for dodging, but often you specify this by using a group=
aesthetic. Here, we'll show the same boxplots according to df$seq.cat
, but we will "dodge" the boxplots according to their df$sequences
value:
ggplot(df, aes(seq.cat, y)) +
geom_boxplot(aes(group=sequences, fill=seq.cat)) +
coord_flip() + theme_bw()

You can even get creative and show the overall boxplot "under" the individual sequence boxplots if you want - mostly just to showcase to you the types of things you can use in ggplot
to do this and to indicate how the ordering of geoms in the ggplot()
call results in layers of geoms on top of one another:
ggplot(df, aes(seq.cat, y)) +
geom_boxplot(aes(fill=seq.cat), alpha=0.2) +
geom_boxplot(aes(group=sequences, fill=seq.cat)) +
coord_flip() + theme_bw()

Hope that's enough for you to go on to answer your question. Note that you need to do preprocessing of the data beforehand - again being sure to preserve the unaltered data and only provide different ways that you can group the data (which does not change your actual data).
(ASIDE)
In the future, please try to post some example dataset in order to have a minimal reproducible example. Typically, this means posting your code (which you did), but also an example of the output (image) when possible, and especially some data frame that goes along with your code. The data frame can be your data, a subset of your data, or another example dataset that illustrates your question. If you put dput(your.data.frame)
into the console, the output can be pasted right into your question to allow SO'ers to use the same data frame.
Example Dataset used:
structure(list(sequences = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L,
4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L,
6L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L,
8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 9L, 9L, 9L, 9L, 9L, 9L,
9L, 9L, 9L, 9L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L,
10L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 12L, 12L,
12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 13L, 13L, 13L, 13L, 13L,
13L, 13L, 13L, 13L, 13L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 14L,
14L, 14L), .Label = c("aaaaaaa", "aaaaaat", "aaaaata", "aaaaatt",
"aaaatat", "aaaatta", "aaaattt", "ttttaaa", "ttttaat", "ttttata",
"ttttatt", "tttttat", "tttttta", "ttttttt"), class = "factor"),
y = c(0.369231635284751, 0.334472301615182, 0.336401541613538,
0.271850593628982, 0.548246553920441, 0.214753232428961,
0.272931216681757, 0.389963934540836, 0.161901134711628,
0.287687044414413, 0.559412668094905, 0.511042084622797,
0.597587755059323, 0.500017456043711, 0.561501596281245,
0.458014405582298, 0.808128498550588, 0.66375054115016, 0.599075775610392,
0.443971210496118, 0.592652837891229, 0.700878174322006,
0.805834612463509, 0.73297398303544, 0.587026084652387, 0.72117372267475,
0.73706613409661, 0.72568140022593, 0.679518188568345, 0.75535253075742,
0.721895352310081, 0.528850684559142, 0.650718664272936,
0.734761143215842, 0.594052856201981, 0.604613813151693,
0.639617263646129, 0.615627312141315, 0.697516133026403,
0.546807852963315, 0.0419542045093033, 0.0530867427743355,
-0.0295728194080514, 0.186503734298246, 0.137057493792742,
0.0889750527784836, 0.0645351842041302, 0.0410801985961975,
0.0544020482151214, 0.0095233043656552, 1.00261072509423,
1.0061820771478, 0.93384301837102, 0.862063619357913, 0.958852725861385,
0.886465265255992, 1.08645738913725, 0.938979900458647, 0.909544023142776,
1.04298470598537, 0.18097909664628, 0.188127809291404, 0.369707831198092,
0.194019386606537, 0.231499528208205, 0.278539551245994,
0.244508784125559, 0.225239703704488, 0.316376968568025,
0.306888135885525, 1.08321647786231, 0.741250605755319, 0.920826657383399,
0.909790347962312, 0.857890202217418, 0.834109623218618,
0.821643124416019, 0.878543781969851, 0.770070846649379,
0.758540936804054, 0.814560830961658, 1.18351784603039, 0.941264140150978,
0.864331965800611, 1.0516487843867, 0.987107258914654, 0.896314482641831,
0.825699584991788, 1.04032881969714, 0.921906270597259, 0.286468632253135,
0.270909870912734, 0.27180172080904, 0.227302339317363, 0.20278286085882,
0.235233019656869, 0.409427334942824, 0.103457260357185,
0.374943122272895, 0.169149938998089, 0.139420655744854,
0.152160895214924, 0.173794178787149, 0.409061439157534,
0.202183092752316, 0.329908116944302, 0.079264916788022,
0.11868462962438, 0.113313604373663, 0.235918265868379, 0.472409229186149,
0.332292533095422, 0.314459451306904, 0.531725824796639,
0.401415131485931, 0.673040222849771, 0.511928186282114,
0.379449838394305, 0.406302903005807, 0.330168688299693,
0.697001028015928, 0.240944888047631, 0.36862679632926, 0.656175495866837,
0.74006385762291, 0.425231842730887, 0.644456396087279, 0.368047727818937,
0.652041334699297, 0.318438638976521, 0.693903486329515,
0.916059702358207, 0.837186565483507, 0.731343897682531,
0.737129367978127, 0.816520705268809, 0.660761720816765,
0.799788442176542, 0.619028474247718, 0.76733836467068),
seq.cat = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("aaaa",
"tttt"), class = "factor")), row.names = c(NA, -140L), class = "data.frame")