1

I have a data manipulation/table joining question that I have been hitting my head against for a couple of days. I am trying to create plots using ggplot2 that color the data by factors.

The simple way to do this is by using:

ggplot(data, aes(X,Y)) + 
geom_point(aes(color = Factor_A))

This means I need a table that has a column for X, Y, and Factor_A.

However, my factor data and my xy data are in two completely different forms. My factor data is neatly framed like so:

factor_data <-data.frame(
  Sample_ID = c(1:12),
  Factor_A = sample(letters[3:6],12,replace=TRUE),
  Factor_B = sample(letters[7:8],12,replace=TRUE)
)

Sample_ID, Factor_A, and Factor_B each have their own vertical column. So far, great for plotting.

However, my X,Y data is framed like so:

xy_data <-data.frame(
  X = c((1:80)/10),
  "1" = rnorm(80),
  "2" = rnorm(80),
  "3" = rnorm(80),
  "4" = rnorm(80),
  "5" = rnorm(80),
  "6" = rnorm(80),
  "7" = rnorm(80),
  "8" = rnorm(80),
  "9" = rnorm(80),
  "10" = rnorm(80),
  "11" = rnorm(80),
  "12" = rnorm(80),
  check.names = FALSE
)

In this case, each sample ID is horizontally across the top row. Each Sample_ID has its own X,Y spectrum (with the same X values shared across all of them). I am trying to plot all the spectra (in this simplified example, 12 spectra) at once and color each line by one of the Factors.

Does anyone have an idea as to how to join these tables together to make it possible to plot the X,Y data using ggplot2 and still be able to color the plotted lines using Factor_A or Factor_B?

M. L.
  • 25
  • 7
  • 1
    Can you add a small example of each dataset to your question? You can see some ideas on how to do this [here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). – aosmith Jul 18 '18 at 19:41
  • Thank you, I added code for my first dataset to the question. My second dataset is 8010x313 (and it's important to this question because the 313 columns correspond to the 312 rows of the first dataset) so I'm not sure how to produce it in scratch from R. I wish I could attach a csv. – M. L. Jul 19 '18 at 15:58
  • We definitely don't need the whole dataset, just a small piece of it or a small fake dataset that looks similar. How about 6 rows and 4 columns of the xy dataset and the corresponding 4 rows of the other dataset? The answer to the question I linked to gives ides for making *minimal* examples. – aosmith Jul 19 '18 at 16:03
  • I see, that makes sense. I simplified the example, and added the code for a second data frame. Thank you for the suggestion. – M. L. Jul 19 '18 at 17:20
  • Great, seeing the dataset makes things much clearer! (You can add `check.names = FALSE` to your `data.frame()` code to get numbers as headers). This now looks to me like a "reshaping" problem, where you need to convert your`xy` dataset from a wide format to a long format prior to joining. The answers to [this question](https://stackoverflow.com/questions/2185252/reshaping-data-frame-from-wide-to-long-format) should help, especially [this answer](https://stackoverflow.com/a/25856135/2461552). – aosmith Jul 19 '18 at 18:29
  • Thanks aosmith, do you mean like transposing the `xy` data so that the `Sample_ID`'s are in a vertical column? I think because of ggplot2's grammar, there is more I would have to do after transposing the `xy` data. I think with ggplot, I have to select two columns. If I just convert the `xy` dataset from `Sample_ID`'s along the top to `Sample_ID`'s in the lefthand column, I would still have 80 columns that should be in my plot. It seems I have to perhaps put all of the X and Y information into two columns, and then join it to the `factor` data. I'm a little lost on what code might do that. – M. L. Jul 19 '18 at 21:27
  • (Also, thanks very much for the `check.names=FALSE` tip! That was bugging me.) – M. L. Jul 19 '18 at 21:29
  • Not a full transpose but, yes, a very long dataset with X in one column, Y in another, and the sample ID in a third. Those links in my last comment should be extremely helpful for providing code for this "wide to long" task. – aosmith Jul 19 '18 at 21:49
  • Thanks so much aosmith! I tried using melt() from the reshape2 package as that thread suggested, and it seems to have worked perfectly. Now my only problem is that it seems my dumpy computer can't handle plotting 3 million points, but that seems like more of a hardware issue than a code one. :] Thanks again for all of your help! – M. L. Jul 20 '18 at 14:44
  • Actually even my small dataset isn't plotting. So it's not just my hardware after all. But I did successfully reshape my data frame so that's a start! – M. L. Jul 20 '18 at 15:21

1 Answers1

0

aosmith found this thread and this solution for getting my xy_data into the right format for plotting.

So here is the 99% complete solution. There seems to be an issue at the end where I'm not getting the plots I want, but that is a separate ggplot2 grammar issue, I'm sure.

I use the gather() method here from the tidyr package because the melt() function from the reshape2 package was turning my variables into factors instead of numerics. The gather() function turns them into characters, but at least it is easier to convert characters to numerics.

### Simplified example datasets
factor_data <-data.frame(
  Sample_ID = factor(c(1:12)),
  Factor_A = sample(letters[3:6],12,replace=TRUE),
  Factor_B = sample(letters[7:8],12,replace=TRUE)
) # 12 obs x 3 variables

xy_data <-data.frame(
  X = c((1:80)/10),
  "1" = rnorm(80),
  "2" = rnorm(80),
  "3" = rnorm(80),
  "4" = rnorm(80),
  "5" = rnorm(80),
  "6" = rnorm(80),
  "7" = rnorm(80),
  "8" = rnorm(80),
  "9" = rnorm(80),
  "10" = rnorm(80),
  "11" = rnorm(80),
  "12" = rnorm(80),
  check.names = FALSE
) # 80 obs of 13 variables

### Gather method
library(tidyr)
xy_data_gather <- gather(xy_data, key = "Sample_ID", value = "Absorbance", 2:ncol(xy_data)) # 960 obs by 3 variables

### Join
all_data <- left_join(xy_data_gather,factor_data,by=c("Sample_ID"))

### Everything seems good at this point for plotting.

ggplot(data=all_data, aes(x=Wavenumber,y=Absorbance)) + geom_point(aes(color = "Factor_A"))

So at least now the datasets are framed and joined in a way that allows for plotting. If anyone can improve the answer so that the plot actually shows multiple smoothed lines all colored according to factors, that would be amazing.

Thanks aosmith for all the suggestions on all of this!

M. L.
  • 25
  • 7
  • Move `color` inside of `aes()` (like you had it in your original question). – aosmith Jul 20 '18 at 15:56
  • Oh you're right, and I also realized with `glimpse(all_data)` that my variable column is classed as `fct`, factors instead of integers. – M. L. Jul 20 '18 at 17:16