1

I have around 20 variables which are coming from 4 different sources. I want to visualize for each variable how the data across sources varies using ggplot.

I was thinking a line chart would be a good option to visualize. My x-axis can be each responses and 4 lines for the sources would show me how data is changing across these 4 sources of data. I can have region as a split variable to visualize by region.

My data looks like something below (I have provided only 2 variables for simplicity):

library(data.table)

set.seed(1200)

ID <- seq(1001,1100)
region <- sample(1:10,100,replace = T)
Var1_source1 <- sample(1:100,100,replace = T)
Var1_source2 <- sample(1:100,100,replace = T)
Var1_source3 <- sample(1:100,100,replace = T)
Var1_source4 <- sample(1:100,100,replace = T)
Var2_source1 <- sample(1:100,100,replace = T)
Var2_source2 <- sample(1:100,100,replace = T)
Var2_source3 <- sample(1:100,100,replace = T)
Var2_source4 <- sample(1:100,100,replace = T)

df1 <- as.data.table(data.frame(ID,
                                region,
                                Var1_source1,
                                Var1_source2,
                                Var1_source3,
                                Var1_source4,
                                Var2_source1,
                                Var2_source2,
                                Var2_source3,
                                Var2_source4))

I feel this is unique requirement as I do not have anything specific to be plotted on my x axis

Michael Harper
  • 14,721
  • 2
  • 60
  • 84
user1412
  • 709
  • 1
  • 8
  • 25

1 Answers1

1

I am not entirely sure what you are hoping the plot to look like from your description, but the first part of any ggplot is getting the data a long format.

library(tidyverse)

df2 <- gather(df1, group, value, - c(ID, region)) %>%
  separate(group, c("Var", "Source")) 

head(df2)
    ID region  Var  Source value
1 1001      2 Var1 source1    92
2 1002      4 Var1 source1    44
3 1003      5 Var1 source1    15
4 1004      6 Var1 source1    42
5 1005      5 Var1 source1    39
6 1006      6 Var1 source1    48

We now have a column which we can use within the ggplot. I am not entirely sure what you want plotting but this is an example:

ggplot(df2, aes(x = region, y = value, colour = Source)) +
  stat_summary(fun.y = mean, geom ="line")

enter image description here

Or we can use a facet to split between the two variables:

ggplot(df2, aes(x = region, y = value, colour = Source)) +
  stat_summary(fun.y = mean, geom ="line") +
  facet_grid(Var~.)

enter image description here

Michael Harper
  • 14,721
  • 2
  • 60
  • 84
  • Thank you for your answer....may be my bad, I was visualizing to have something like 4 lines which would tell me how the data across the 4 sources are different by region. May be this random data is quite different across sources and hence getting such a plot which is so spread across.....Please suggest if there is any other way to visualize the difference across the 4 sources – user1412 Apr 18 '18 at 11:07
  • Thank you for the edit. Yes, I was looking for something similar. Could you please confirm what does the "~." mean in the facet_grid – user1412 Apr 18 '18 at 11:15
  • Glad it helps. Check out: http://ggplot2.tidyverse.org/reference/facet_grid.html https://stackoverflow.com/q/39148454/7347699 – Michael Harper Apr 18 '18 at 11:17
  • Thank you for pointing me to these materials !! Have a great day ahead !! – user1412 Apr 18 '18 at 11:22