0
     distance1  grey1    distance2  grey2
1    0.0000000 -300.364  0.0000000 -135.219
2    0.2174741 -296.963  0.2114969 -132.601
3    0.4349482 -292.887  0.4229937 -131.959
4    0.6520882 -290.310  0.6341657 -133.514
5    0.8695623 -285.777  0.8456625 -127.111
6    1.0870364 -279.921  1.0571594 -116.404
7    1.3045105 -274.418  1.2686562 -116.850
8    1.5216505 -272.005  1.4798282 -115.464
9    1.7391246 -273.666  1.6913251 -102.823
10   1.9565987 -270.381  1.9028219 -101.497
11   2.1740728 -270.273  2.1143188  -98.245
12   2.3912128 -270.705  2.3254907  -98.474

My x axis is distance which I normalised 0-100. My Y axis is intensity values throughout the distance. I have 2 samples, for which each Y value matches a specific X value (to note sample 2 has more rows than sample 1). I have pasted the first few rows of my data as an example. How can I make one plot with both samples depicted at the same plot? And after that how can I create an average plot of the 2 samples?

  • 3
    It's easier to help you if you provide a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick Jun 03 '22 at 17:47
  • 3
    This is not a reproducible example and it is not easy to replicate in code. You can use `dput()` - or a minimal example - and copy the output editing your post, not as a comment. – RobertoT Jun 03 '22 at 18:11

1 Answers1

1

It's not really clear what you're going for, so I'll provide a couple of demonstrations. Up front, I'm assuming that you have two distinct datasets here, first in columns 1-2, second in columns 3-4. This can be done literally in ggplot2 with:

library(ggplot2)
ggplot(dat) +
  geom_line(aes(distance, grey1), color="red") +
  geom_line(aes(distance1, grey2), color="blue")

basic brute-force ggplot

But this approach is brute-forcing it a bit, and will make things like legends, color-control, etc, rather painful. I suggest this process would benefit from reshaping the data into a long-format, with just the x and y variables plus one to indicate the group from which that row came. For example,

library(data.table)
newdat <- data.table::melt(as.data.table(dat),
    measure = patterns("^distance","^grey"),
    value.name = c("distance", "grey"))
newdat
#     variable  distance     grey
#       <fctr>     <num>    <num>
#  1:        1 0.0000000 -300.364
#  2:        1 0.2174741 -296.963
#  3:        1 0.4349482 -292.887
#  4:        1 0.6520882 -290.310
#  5:        1 0.8695623 -285.777
#  6:        1 1.0870364 -279.921
#  7:        1 1.3045105 -274.418
#  8:        1 1.5216505 -272.005
#  9:        1 1.7391246 -273.666
# 10:        1 1.9565987 -270.381
# ---                            
# 15:        2 0.4229937 -131.959
# 16:        2 0.6341657 -133.514
# 17:        2 0.8456625 -127.111
# 18:        2 1.0571594 -116.404
# 19:        2 1.2686562 -116.850
# 20:        2 1.4798282 -115.464
# 21:        2 1.6913251 -102.823
# 22:        2 1.9028219 -101.497
# 23:        2 2.1143188  -98.245
# 24:        2 2.3254907  -98.474

Where the new variable column indicates from which column-group the data came from.

Here, the plotting in ggplot becomes a bit simpler:

ggplot(newdat, aes(distance, grey)) +
  geom_line(aes(color = variable, group = variable))

better ggplot

Notice that we now have a legend, and it is handling colors itself. These can be overridden, but that's a different topic (and addressed in numerous questions here on SO).


As for "average plot of the 2 samples", this will take a bit more context into the data, and is currently not well-enough informed. My biggest concern is that neither distance nor grey for each group of data is perfectly aligned. That is, if distance had a value of exactly 1.000 in both, then I think we can safely average the grey values of those two observations. However, this is not the case in general (nor anywhere in this sample dataset).

If you really want to find a form of average, I suggest you interpolate both lines onto a known domain of distance and show the average. I'll demo what I mean.

First, I'll add points so that we can see the x-wise misalignment:

ggplot(newdat, aes(distance, grey, color = variable)) +
  geom_line() +
  geom_point()

same ggplot with added points

Now, let's aggregate the "average" (from interpolated distance and add them to the original long-form data.

newdist <- seq(0, min(max(dat$distance), max(dat$distance1)), by = 0.1)
newdat2 <- newdat[, setNames(approx(distance, grey, xout = newdist), c("distance", "grey")), by = variable
  ][, .(variable = "Avg", grey = mean(grey)), by = distance]
newdat2 <- rbindlist(list(newdat, newdat2), use.names = TRUE)

Now, we can use the same plot command and get the third line:

ggplot(newdat2, aes(distance, grey, color = variable)) +
  geom_line() +
  geom_point()

final plot with average line and dots

This method is making some inferences on the data, something we don't have much of here in the question. I think this is a safe step, but make sure it makes sense statistically before blindly using this technique on your data.


Data (I started writing this before the columns were renamed, so follow-on code may need to be adjusted).

dat <- structure(list(distance = c(0, 0.2174741, 0.4349482, 0.6520882, 0.8695623, 1.0870364, 1.3045105, 1.5216505, 1.7391246, 1.9565987, 2.1740728, 2.3912128), grey1 = c(-300.364, -296.963, -292.887, -290.31, -285.777, -279.921, -274.418, -272.005, -273.666, -270.381, -270.273, -270.705), distance1 = c(0, 0.2114969, 0.4229937, 0.6341657, 0.8456625, 1.0571594, 1.2686562, 1.4798282, 1.6913251, 1.9028219, 2.1143188, 2.3254907), grey2 = c(-135.219, -132.601, -131.959, -133.514, -127.111, -116.404, -116.85,  -115.464, -102.823, -101.497, -98.245, -98.474)), class = "data.frame", row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • I think I understand what you mean ... and this is why it is important and helpful to have representative data. Assuming that you know about 15 instances of `NA` in one of the groups of columns, then this is expected/normal. What you see there is a *warning*, so it can be safely ignored (or even suppressed if you really need to). To be clear, though, I *discourage* the first code block that does not reshape the data. It is sloppy-ggplot code, and if you want to do much else with the plots, most answers on SO will start with "stop doing it that way" and revert to a long-format dataset. – r2evans Jun 03 '22 at 19:34
  • (That comment was made in response to a since-deleted comment about `geom_line` removing NA values in the data.) – r2evans Jun 03 '22 at 19:35
  • Thank you so much for your useful help! I have trying your reshaping method and works really well. The final thing it would be nice to have along with the average line is standard deviation. Would you perhaps know how to add that to your R script above? Thank you again. – Ruthilia Vera Jun 04 '22 at 16:46
  • I cannot upload an image to show I believe or more because I am new and I need to have a higher reputation on the site apparently, so that is a bit restricting. – Ruthilia Vera Jun 04 '22 at 16:48
  • One way would be to repeat the `newdat2 <- ...` process, perhaps naming it `newdat3` or `newdat_sd` or something meaningful, replacing `mean(..)` with `sd(..)`. After that, again do `newdat <- rbindlist(list(newdat, newdat_sd))` and then plot. I don't know if `sd` is very meaningful, though, since each interpolated `distance` will have just two values; some value, yes, lots of value? not certain. Perhaps your real data is more varied than this? – r2evans Jun 05 '22 at 00:26
  • Thanks @r2evans. I have tried the code you suggest but gives me a different result with another code I tried that seems more logical. Do you have any ideas on that? I managed to post as an answer also a full example of mine with 3 variables that have normalised x axis from 0-100. I wish along with the average line to be plotting the SD, which is also calculated from the same x values as the average line is (meaning a common x value for all the variables). Is that possible? Thank you a lot in advance! – Ruthilia Vera Jun 05 '22 at 19:38
  • This question appears to be spiraling a bit, Ruthilia Vera. The more you show me, the more I think it's more important for you to reshape/pivot into a long format. It will take a learning-curve for you to think in that fashion. One you do that, learn how to use `dplyr::group_by`, `data.table`'s `by=`, or base's `ave`, `aggregate`, `tapply`, or other similar functions for calculating something by-group. See https://stackoverflow.com/q/11562656/3358272 for a lot of discussion in that direction. – r2evans Jun 05 '22 at 21:37
  • 1
    Thank you @r2evans. Yes, I agree. I will have a look at your recommendations. – Ruthilia Vera Jun 05 '22 at 22:02