0

I have a dataset with the number of reads for more than 3000 organisms obtained on 3 different stages of experiment. The data looks something like this:

          rain      day0      day7
 org1     923857    505062    503292
 org2     424002    198440    26314
 org3     2910      1492      535

...with 3000 more rows trailing this.

I want to plot the trend (number of reads) for each organism across different stages.. (start, day0, day7). Each organism should be represented with a different color and all should be in the same plot.

I have tried doing the same in excel but it has a limit of only 255 such lines in a single plot.

The plot I obtained in excel: Example Plot

Is there a way of doing this in R? I am new to R and therefore don't know much. I think ggplot might work but I'm having hard time understanding how to use it on this data.

Any help is greatly appreciated. Thanks.

Z.Lin
  • 28,055
  • 6
  • 54
  • 94
user_14
  • 1
  • 2
  • What is the name of the column containing the organism name? Have you written code to load your data into R? – Jack Brookes Jan 21 '19 at 01:10
  • The column was initially named 'names'. but then I used that column as the row names. I can revert it back if that helps. And yes I have the data loaded. – user_14 Jan 21 '19 at 01:29
  • I know this doesn't answer your actual question, but do you have the option to make a different kind of graph? I'm wondering how well a human could distinguish 3,000 different colors on the same graph. What about, say, a scatterplot of `reads_in_day0 - reads_in_rain` against `reads_in_day7 - reads_in_rain` instead? – A. S. K. Jan 21 '19 at 06:41

1 Answers1

0

Here is a version using library(tidyverse)

I created a data.frame based upon the data you provided,
gather these variables to put the data into a long format,
changed the factor levels so that they align with the plot you provided,
and used ggplot to produce a figure.

data.frame(org = letters[1:3],
           rain = c(923857, 424002, 2910),
           day0 = c(505062, 198440, 1492),
           day7 = c(503292, 2614, 535)) %>% 
  gather(variable, value, -org) %>% 
  mutate(variable = factor(variable, levels = c('rain', 'day0', 'day7'))) %>% 
  ggplot(aes(variable, value, color = org, group = org)) + 
  geom_point() +
  geom_line() +
  theme(legend.position="bottom")
B Williams
  • 1,992
  • 12
  • 19
  • Hi Thanks for the answer. This works well when I have a much smaller dataset. But I extrapolated this same code to work on the original >3000 rows dataset, and its not showing any errors but its still running after 30 min.... No plots yet. Do you think there is any time-efficient way of doing this? Or should I consider more pre-processing of the data? – user_14 Jan 21 '19 at 03:31
  • 3000 rows isn't a problem - likely has to do with how you are "extrapolating" - how are you doing this? – B Williams Jan 21 '19 at 04:12
  • I have used the columns from my existing df (called f) for the corresponding data in org, rain, day0 and day7. So essentially, instead of using org = letter[1:3], rain = c(....), and so on, I have used ... org = f[,1], rain = f[,2] and so on... the rest I have kept the same. – user_14 Jan 21 '19 at 04:29
  • pretty difficult to assist without the actual data in hand - see here https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example for some advice to making it easier for others to help you – B Williams Jan 21 '19 at 04:47
  • The actual data is the exact same data I shared in the question with 3000 more rows representing 3000 more organisms. I think since I'm using RStudio, it took a lot of time to process the data and plot. RStudio is slow when it comes to plotting. And also I was not able to view the actual plot since the legend took up all the space. So, I removed the legend and was able to view that the plot actually works, but I am still going to do a little more preprocessing of the data so it becomes more legible. Thanks a lot for your answer. This really helped. – user_14 Jan 21 '19 at 05:13