0

(see image in link for better explanation)

Trying to plot a log boxplot. I am very new to R and have tried to read tutorials but they all seem to use a different plotting function?

1/ I would like to know how to change y-axis values (i.e. to 0.001, 0.01, 0.1, 1 etc.) whilst retaining log scale?

2/ I would also like to know how to overlay a scatter plot of the data over the box?

3/ Finally, advice on how to add gridlines and border, of chosen weight and colour, and axis titles would be great?

So far, only code used is:

boxplot(box,
        varwidth = TRUE, log = "y", las = 1)

Sorry it's so obvious but thanks guys!

Reproducible: (first 30 data point)

structure(list(CD = c(0.291998350286, 58.4266839332, 1.27227891359, 
7.05106388302, 0.000175203165079, 14.5665189804, 0.991317477169, 
1.56817217741, 30.4733699427, 0.421737157934, 1.42372160368, 
0.333712081068, 0.126643859356, 0.339337851064, 0.151788605996, 
3.81711532569, 1.54344215823, 17.2540240816, 3.67548135199, 4.08331544672, 
0.0549081111653, 0.0734888395127, 5.16751927204, 22.6971132167, 
1.04321972985, 0.184343635879, 2.29291935133, 0.0555342051937, 
0.411328596454, 51.3157360015), WD = c(0.402162969955, 0.189544929529, 
0.000840280055822, 0.0501429051167, 3.4853343866, 0.0286017538011, 
0.0121948073037, 0.992426638872, 0.0192559537415, 0.00398698494632, 
0.888543226817, 0.703331842713, 0.378008558951, 4.70639786908, 
0.113706495683, 1.32546254378, 0.936899368015, 0.108969215053, 
0.25593198462, 0.564518000036, 0.121389166752, 0.195884521759, 
0.704964462359, 1.25602965005, 0.0242662609253, 2.11883481514, 
0.44581781826, 0.659586439033, 0.36869665263, 0.824802234027), 
    MC = c(0.0817800846374, 1.70562818122, 0.0807325401412, 0.180484111266, 
    0.0438908620273, 8.75617400342, 0.479370274286, 0.908307567192, 
    2.81446961622, 0.0699990348088, 0.0491805903311, 0.00573142245572, 
    0.116352754956, 0.311847695137, 0.0414215549125, 0.104499713126, 
    0.0551723673287, 0.076199002014, 0.191940770942, 4.11745930602, 
    1.75751348869, 0.0517694407553, 2.29459310871, 0.0269233884783, 
    0.097992042257, 11.7325079183, 0.262543381616, 0.748125397347, 
    0.635821595694, 0.794256126423), WC = c(0.0686062258206, 
    0.514240129693, 7.68226019254, 4.36776848419, 0.618214352027, 
    2.13911888244, 0.0392505689889, 0.0823059942863, 2.36466448826, 
    0.0688590035687, 0.151457824484, 0.260629997743, 8.30460664472, 
    0.235838508742, 0.41960151168, 4.38818043685, 0.0797918590848, 
    0.109025596179, 0.0837286212892, 0.0117251770506, 1.17739717792, 
    0.207413909376, 8.62180088733, 2.33021344099, 0.166981061366, 
    1.13410263425, 0.0905601584251, 0.154075808752, 0.140498581833, 
    0.213863468391), MWC = c(301.891645135, 0.672405306137, 0.105110378336, 
    5.36947765018, 0.672138277335, 3.58296467263, 10.7754596083, 
    5.01795685162, 0.0775842457366, 1.07683084271, 1.0360624974, 
    16.8763517534, 0.390002867544, 1.50618637339, 0.371973397842, 
    1.28366689573, 0.0633246500391, 0.0364964802158, 0.249895194073, 
    0.0379084221473, 0.0798275709535, 0.504735639066, 8.12262202509, 
    82.5787360252, 0.068574731873, 8.76779568117, 0.00873932360562, 
    0.0142029221366, 0.0228083224849, 0.146073745479)), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -30L))
>
Ben
  • 3
  • 2
  • 1
    Welcome to Stack Overflow. You'll get better answers if you [make this question reproducible](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) by including a small representative dataset in a plain text format - for example the output from `dput(box)`, if that is not too large. It's not clear what is meant by "something more manageable whilst retaining log scale". Using `ggplot2` the functions you need to learn are `geom_boxplot()`, `geom_jitter()`, `labs()` and `theme()`. – neilfws Jul 25 '23 at 01:15
  • @neilfws thanks for the comment, appreciate it! Have input your suggestions. "Manageable" just meant alternative for anyone unfamiliar with logs. thank you. struggled to find threads that approached how to use them for an absolute beginner – Ben Jul 25 '23 at 01:58
  • Thanks for the data: I think there's a missing parenthesis somewhere and a couple of missing values, can you try again and carefully copy-paste the output from `dput`. – neilfws Jul 25 '23 at 02:04
  • 1
    @neilfws Apologies, dataset was 000s of rows long so tried to snip manually. Have cut to 30 rows and reprocessed dput(box) - hopefully no errors this time? – Ben Jul 25 '23 at 02:10

1 Answers1

0

Lots of questions in one here, which really boil down to "how to use ggplot2". Here's a good introductory guide.

First, your data are in "wide" format, ggplot2 works better with "long" format (one column for data names, one for their values). We can use tidyr::pivot_longer() for that. By default it generates new columns name and value.

For a boxplot we use geom_boxplot(). By "scatter plot" I think you mean "jitter plot", which is the usual way to overlay individual data points on a boxplot. The appropriate function is geom_jitter().

Labels for y-axis values can be altered in several different ways. One is to use functions from the scales package. Another is to supply a labelling function - see the code below.

Axis titles can be added using the labs() function.

Gridlines and border of chosen weight and color: well, it depends what you want exactly, but in general you would use theme() and look for arguments related to panel. In the example code below we add a thick red border.

So putting all of that together:

library(ggplot2)
library(tidyr)
library(dplyr)

box %>% 
  pivot_longer(everything()) %>% 
  ggplot(aes(name, value)) + 
  geom_boxplot(outlier.shape = NA) + 
  geom_jitter(width = 0.2) + 
  scale_y_log10(labels = function(x) format(x, scientific = FALSE)) +     
  theme_bw() + 
  theme(panel.border = element_rect(fill = NA, color = "red", size = 2)) +  
  labs(x = "Group", y = "Value")

Result. Hope that helps you to get started.

enter image description here

neilfws
  • 32,751
  • 5
  • 50
  • 63
  • thank you so much! It is great to have some code to work with which works so I can reproduce others, really appreciate it! Two further questions if that's ok: 1/ can jitters be organised so they are ordered chronologically / by order of appearance from left to right for example? Or is that not possible? 2/ I also have a question on equally space square root y-scales on box plots; I can't seem to ensure the sqrt y-labels are equally distanced (like log graphs), whilst ensuring the data values update to correspond; would you mind taking a look at the code i have so far? thanks so much again! – Ben Jul 25 '23 at 06:16
  • If you mean order the points rather than the categories? No, jittering is random. The aim is just to show the number of data points and their distribution. The second question, you should probably post as a new question. – neilfws Jul 25 '23 at 10:50
  • Essentially meant both. Retain the box plots with CD WD etc main categories on x, but for each main category plot the overlain jitter/scatter points left to right chronologically so there retains a time series aspect - sounds not possible from what you have said? Question 2 I have somehow managed to solve after a lot of fiddling, thank you though! Not sure if worth posting and providing solution in case someone else is interested? – Ben Jul 27 '23 at 00:11