1

I have a very large data set with two columns which relate as below.

df <- data.frame(
  group = c("123-4", "123-4", "234-5", "234-5", "345-6", "345-6"),
  age = c(38, 41, 65, 67, 78, 23))

group      age
123-4 38
123-4 41
234-5 65
234-5 67
345-6 78
345-6 23

I want to be able to plot the ages for each group against each other. I can do it by pulling min and max values of each group out but I want to maintain the randomness of my xy instead of having all the min values x and all the max values y. Seems this should be very easy but I am beating head against the proverbial wall.

MrFlick
  • 195,160
  • 17
  • 277
  • 295
Bruce
  • 113
  • 8
  • Would you find useful something like this ? https://stackoverflow.com/questions/41764818/ggplot2-boxplots-with-points-and-fill-separation – AntoniosK Nov 26 '18 at 17:09
  • A scatter plot would also be possible, but I'm not sure it's a good approach. It really depends on the nature of your grouping (`group` variable) and whether it makes sense to apply some kind of ordering. – AntoniosK Nov 26 '18 at 17:13
  • 2
    I'm unclear on what type of visual you want. Are you trying to show the distribution of ages within groups? Like a beeswarm or jittered scatter plot? – camille Nov 26 '18 at 17:16
  • I want to use a scatterplot. Most of these pairs will congregate about a pretty linear center but I want to make the outliers stand out more by not plotting min() and max() Ordering is irrelevant in this case, the "group" is just assigned numbers and has no order. – Bruce Nov 26 '18 at 17:48
  • all the "groups" will only have two member. I want to pull out visually those groups that have a greater age difference than is typical. – Bruce Nov 26 '18 at 17:49

2 Answers2

0

We can write a helper function to exact a value for each group.

group_val <- function(values, groups, index=1) tapply(values, groups, `[`, index)

For example

with(df, group_val(age, group, 1))
# 123-4 234-5 345-6 
#    38    65    78 
with(df, group_val(age, group, 2))
# 123-4 234-5 345-6 
#    41    67    23 

Then you could do

plot(group_val(df$age, df$group, 1), group_val(df$age, df$group, 2))
# or plot(group_val(age, group, 2) ~ group_val(age, group, 1), df)

Though the more usual way to handle this would be to reshape your data from long to wide. There are plenty of other questions on this site about that task. But if you want to use gpplot you'd have to do it that way. For example

library(mutate)
library(tidyr)
library(ggplot2)
df %>% group_by(group) %>% 
  mutate(seq = letters[1:n()]) %>% 
  spread(seq, age) %>% 
  ggplot(aes(a,b)) + geom_point()
MrFlick
  • 195,160
  • 17
  • 277
  • 295
0

Mr.Flicker nailed it with the right idea, long to wide. Easy fix as I knew it should be but too new to figure out

wide <- as.data.frame((t(unstack(df,age~group))))
Bruce
  • 113
  • 8