2

I have a dataset in long-format, each ID 'walks' 3 steps, each step (variable name is step) can land on different locations (variable name is milestone), I want to draw all of the paths. Because there are some paths more traveled, I want to make the width (size) of the paths proportional to their counts. I am imagining it to be something like geom_line(aes(size=..count..))in ggplot, but it doesn't work.

Below is my code, in the code you can find the url for the example dataset. My silly solution to add width was to dodge the line, but it's not proportional, and it leaves cracks.

ddnew <- read.csv("https://raw.github.com/bossaround/question/master/data9.csv" ) 
ggplot(ddnew, aes(x=step, y=milestone, group=user_id)) +
  geom_line(position = position_dodge(width=0.05)) +
  scale_x_discrete(limits=c("0","1","2","3","4","5","6","7","8","9")) +
  scale_y_discrete(limits=c("0","1","2","3","4","5","6","7","8","9"))

The plot from my current code looks like this, but you can see the cracks, and it's not proportional. enter image description here

I was hoping this can look like a Sankey diagram with the width indicating counts.

Chuck C
  • 153
  • 2
  • 12

3 Answers3

2

Does this help?

library(ggplot2)
ddnew <- read.csv("https://raw.github.com/bossaround/question/master/data9.csv" ) 
ggplot(ddnew, aes(x=step, y=milestone, group=user_id)) +
        stat_summary(geom="line", fun.y = "sum", aes(size=milestone),alpha=0.2, color="grey50")+
        scale_x_discrete(limits=factor(0:2)) +
        scale_y_discrete(limits=factor(0:10)) +
        theme(panel.background = element_blank(), 
              legend.position = "none")
knb
  • 9,138
  • 4
  • 58
  • 85
  • Thanks knb! I'd never thought about stat_summary. However, your code does not give the correct solution, I think `fun.y="sum"` give the sum of the value, but not the count of the cases. If there is something like `fun.y="count"`, it will solve it. But there is not such an option. – Chuck C Apr 14 '17 at 19:25
2

One option is to use the riverplot package. First you'll need to summarize your data so that you can define the edges and nodes.

> library(riverplot)
> 
> paths <- spread(ddnew, step, milestone) %>%
+   count(`1`, `2`, `3`)
> paths
Source: local data frame [9 x 4]
Groups: 1, 2 [?]

    `1`   `2`   `3`     n
  <int> <int> <int> <int>
1     1     2     3     7
2     1     2    10     8
3     1     3     2     1
4     1     4     8     1
5     1    10     2   118
6     1    10     3    33
7     1    10     4     2
8     1    10     5     1
9     1    10    NA    46

Next define your nodes (i.e. each combination of step and milestone).

prefix <- function(p, n) {paste(p, n, sep = '-')}

nodes <- distinct(ddnew, step, milestone) %>%
  mutate(ID = prefix(step, milestone),
         y = dense_rank(milestone)) %>%
  select(ID, x = step, y)

Then define your edges:

e12 <- group_by(paths, N1 = `1`, N2 = `2`) %>%
  summarise(Value = sum(n)) %>%
  ungroup() %>%
  mutate(N1 = prefix(1, N1),
         N2 = prefix(2, N2))

e23 <- group_by(paths, N1 = `2`, N2 = `3`) %>%
  filter(!is.na(N2)) %>%
  summarise(Value = sum(n)) %>%
  ungroup() %>%
  mutate(N1 = prefix(2, N1),
         N2 = prefix(3, N2))

edges <- bind_rows(e12, e23) %>% 
  mutate(Value = Value) %>%
  as.data.frame()

Finally, make the plot:

style <- default.style()
style$srt <- '0'  # display node labels horizontally

makeRiver(nodes, edges) %>% plot(default_style = style)

enter image description here

  • Joel, thank you for your answer, I encountered an error at `count(`1`, `2`, `3`)`. The error message says cannot found object '1'... any idea where went wrong? Thanks – Chuck C Apr 15 '17 at 16:48
  • not sure. does `spread(ddnew, step, milestone)` give three new columns called `1`, `2`, and `3`? – Joel Galang Apr 15 '17 at 17:51
  • I figured it out, it's tricky to have R recognize numbers as variable names... I changed the variable names and it worked out fine. Thanks again! – Chuck C Apr 15 '17 at 17:59
  • Joel, now that I have renamed 1, 2, and 3 to step1, step2 and step3, should I code `e12 <- group_by(paths, N1 = step1, N2 = step2) %>%` at the Define Edge section? I am not sure what the ' ` ' does. – Chuck C Apr 15 '17 at 18:09
  • That's correct. The back quotes are for column names that don't conform to R's naming conventions. – Joel Galang Apr 15 '17 at 22:33
  • Thanks Joel, I have just produced the graph following to your advice! I have one last question though, is there a way to add a spectral color or/and alpha (transparency) indicating the amount to the edges? Any suggestion is appreciated, thanks again! – Chuck C Apr 16 '17 at 05:30
  • You can add color to the edges by modifying the data frame for the edges. See the documentation for `makeRiver` – Joel Galang Apr 16 '17 at 22:20
1

If you are looking for user-specifc counts of paths then this might help:

ddnew <-   read.csv("https://raw.github.com/bossaround/question/master/data9.csv" ) 

ddnew <- ddnew %>% 
  group_by(user_id) %>% 
  mutate(step_id = paste(step, collapse = ","), 
         milestone_id = paste(milestone, collapse = ",")) %>% 
  group_by(step_id, milestone_id) %>% 
  mutate(width = n())

ggplot(ddnew, aes(x=step, y=milestone, group=user_id)) +
  geom_line(aes(size = width)) +
  scale_x_discrete(limits=c("0","1","2","3","4","5","6","7","8","9")) +
  scale_y_discrete(limits=c("0","1","2","3","4","5","6","7","8","9"))

The idea is to count unique user-specific paths and assign these counts as width in the geom_line() aesthetic. enter image description here

Felix
  • 1,611
  • 13
  • 22