92

I am trying to visualize my data flow with a Sankey Diagram in R.

I found this blog post linking to an R script that produces a Sankey Diagram; unfortunately, it's quite raw and somewhat limited (see below for sample code and data).

Does anyone know of other scripts—or maybe even a package—that is more developed? My end goal is to visualize both data flow and percentages by relative size of diagram components, like in these examples of Sankey Diagrams.

I posted a somewhat similar question on the r-help list, but after two weeks without any responses I'm trying my luck here on stackoverflow.

Thanks, Eric

PS. I'm aware of the Parallel Sets Plot, but that is not what I'm looking for.

# thanks to, https://tonybreyal.wordpress.com/2011/11/24/source_https-sourcing-an-r-script-from-github/
  sourc.https     <- function(url, ...) {
# install and load the RCurl package 
if (match('RCurl', nomatch=0, installed.packages()[,1])==0) {
  install.packages(c("RCurl"), dependencies = TRUE)
  require(RCurl)  
} else require(RCurl)    

# parse and evaluate each .R script
  sapply(c(url, ...), function(u) {
    eval(parse(text = getURL(u, followlocation = TRUE, 
    cainfo  = system.file("CurlSSL", "cacert.pem", 
    package = "RCurl"))), envir = .GlobalEnv)
 } )
 }

# from https://gist.github.com/1423501
sourc.https("https://raw.github.com/gist/1423501/55b3c6f11e4918cb6264492528b1ad01c429e581/Sankey.R")

# My example (there is another example inside Sankey.R):
inputs = c(6, 144)
losses = c(6,47,14,7, 7, 35, 34)
unit = "n ="

labels = c("Transfers",
           "Referrals\n",
           "Unable to Engage",
           "Consultation only",
           "Did not complete the intake",
           "Did not engage in Treatment",
           "Discontinued Mid-Treatment",
           "Completed Treatment",
           "Active in \nTreatment")

SankeyR(inputs,losses,unit,labels)

# Clean up my mess
rm("inputs", "labels", "losses", "SankeyR", "sourc.https", "unit")

Sankey Diagram produced with the above code, Sankey Diagram produced with the code above

AndrewGB
  • 16,126
  • 5
  • 18
  • 49
Eric Fail
  • 8,191
  • 8
  • 72
  • 128
  • 2
    The arrows look fine to me, looks like you're left with fine tuning the text and you're in? – Roman Luštrik Apr 03 '12 at 08:09
  • @Roman Luštrik, I agree, this diagram isn't bad at all, but my R skills are still limited so I can't really do that much fine tuning in R, if that was what you meant? I could of course do it in Adobe Illustrator, or something like it, but that would break the principle of reproducible research, which for me is a central element in any (academic) work. Did you look at [the examples I linked to in the post](http://www.sankey-diagrams.com/tag/software/)? – Eric Fail Apr 03 '12 at 17:52
  • I realize my question is not a good question in the sense that it is not a specific programming problem and not directly practical, but a somewhat open-ended question ([from the FAQ](http://stackoverflow.com/faq)). To answer this question one would either have to have oversight over the different graphing options in R and on that basis answer my question with a _no, there is no scrips or package out there that are more developed_, or one would need to know of a more developed method to produce Sankey Diagrams in R and point to it. Maybe there is a better place to post this question? – Eric Fail Apr 03 '12 at 18:01
  • 1
    The only place I can come up with is maybe crossvalidated.com. – Roman Luštrik Apr 03 '12 at 18:43
  • How about the R-help mailing list? http://www.r-project.org/mail.html – Alex Reynolds Apr 03 '12 at 22:05
  • @AlexReynolds, that was the first thing [I did](http://tolstoy.newcastle.edu.au/R/e17/help/12/03/7682.html), two weeks ago (please see fourth paragraph in my question). – Eric Fail Apr 03 '12 at 22:52
  • Does not any algorithmically produced data graphic count as *reproducible research*? You might have to use a different language to get the result you want. – RobinGower Apr 03 '12 at 23:27
  • @RobinGower, good point. The thing is that I am working in a lab that don't have that many technical resources, so to start using things outside R to produce this plot wouldn't work. Unfortunately. R is normally quite superior when is comes to data visualization, so I was surprised to find that no one had made a package that could produce Sankey Diagrams. – Eric Fail Apr 04 '12 at 00:32

10 Answers10

68

This plot can be created through the networkD3 package. It allows you to create interactive sankey diagrams. Here you can find an example. I also added a screenshot so you have an idea what it looks like.

# Load package
library(networkD3)

# Load energy projection data
# Load energy projection data
URL <- paste0(
        "https://cdn.rawgit.com/christophergandrud/networkD3/",
        "master/JSONdata/energy.json")
Energy <- jsonlite::fromJSON(URL)
# Plot
sankeyNetwork(Links = Energy$links, Nodes = Energy$nodes, Source = "source",
             Target = "target", Value = "value", NodeID = "name",
             units = "TWh", fontSize = 12, nodeWidth = 30)

enter image description here

Tung
  • 26,371
  • 7
  • 91
  • 115
Jonas Tundo
  • 6,137
  • 2
  • 35
  • 45
  • 4
    example link is broken – Nelson Auner Aug 05 '16 at 17:51
  • 1
    Indeed. A better alternative since the introduction of `htmlwidgets` is the sankey plot from the `networkD3` package. I updated the post. – Jonas Tundo Aug 09 '16 at 21:08
  • 1
    Is it possible to have numeric values as caption instead of integer? The values are taken correctly, but the caption seems to be rounded off. Eg: value=0.8 and value=0.2 have different line widths, but the caption says '0' for both. – Naveen Mathew Sep 01 '16 at 09:22
  • if you try to reproduce this with some sample of your data of your own, make sure the first source id starts with 0 and the source and target id's are successive – Richard Mar 11 '18 at 08:29
47

I have created a package (riverplot) that has a slightly different, but overlapping functionality compared to the Sankey function, and can produce plots like this one:

enter image description here

January
  • 16,320
  • 6
  • 52
  • 74
40

If you want to do it with R, your best bid seems to be @Roman suggestion - hack the SankeyR function. For example - below is my very quick fix - simply orient labels verticaly, slighlty offset them and decrease the font for input referals to make it look a bit better. This modification only changes line 171 and 223 in the SankeyR function:

    #line171 - change oversized font size of input label
    fontsize = max(0.5,frInputs[j]*1.5)#1.5 instead of 2.5 

    #line223 - srt changes from 35 to 90 to orient labels vertically, 
    #and offset adjusts them to get better alignment with arrows
    text(txtX, txtY, fullLabel, cex=fontsize, pos=4, srt=90, offset=0.1)

enter image description here

I am no ace in trigonometry, but this is really what you need for changing the direction of arrows. That would be ideal in my view - if you could adjust looses arrows so they are oriented horizontally rather then vertically. Otherwise, why my solution fixes the problem with labels orientation, it doesn't make the diagram much more readable...

Geek On Acid
  • 6,330
  • 4
  • 44
  • 64
  • 1
    that's a nice hack, thanks. I already made it much better. You have my up-vote and if nothing better comes op I'm happy to transfer the bounty to you when the time runs out. Also, I like your user name. – Eric Fail Apr 05 '12 at 06:14
25

In addition to rCharts, Sankey diagrams can now be also generated in R with googleVis (version >= 0.5.0). For example, this post describes the generation of the following diagram using googleVis: enter image description here

leo9r
  • 2,037
  • 26
  • 30
16

R's package will also do this (from ?alluvial).

# install.packages(c("alluvial"), dependencies = TRUE)
require(alluvial)

# Titanic data
tit <- as.data.frame(Titanic)

# 4d
alluvial( tit[,1:4], freq=tit$Freq, border=NA,
     hide = tit$Freq < quantile(tit$Freq, .50),
     col=ifelse( tit$Class == "3rd" & tit$Sex == "Male", "red", "gray") )

enter image description here

Eric Fail
  • 8,191
  • 8
  • 72
  • 128
geotheory
  • 22,624
  • 29
  • 119
  • 196
12

plotly has the same power as networkD3 package (example link).

enter image description here

cuttlefish44
  • 6,586
  • 2
  • 17
  • 34
7

For completeness, there is also the ggalluvial package which is a ggplot2 extension for alluvial/Sankey diagrams.

Here is an example taken from the package's documentation

# devtools::install_github("corybrunson/ggalluvial", ref = "optimization")
library(ggalluvial)

titanic_wide <- data.frame(Titanic)
ggplot(data = titanic_wide,
       aes(axis1 = Class, axis2 = Sex, axis3 = Age,
           y = Freq)) +
  scale_x_discrete(limits = c("Class", "Sex", "Age"), expand = c(.1, .05)) +
  xlab("Demographic") +
  geom_alluvium(aes(fill = Survived)) +
  geom_stratum() + geom_text(stat = "stratum", label.strata = TRUE) +
  theme_minimal() +
  ggtitle("passengers on the maiden voyage of the Titanic",
          "stratified by demographics and survival") +
  theme(legend.position = 'bottom')

ggplot(titanic_wide,
       aes(y = Freq,
           axis1 = Survived, axis2 = Sex, axis3 = Class)) +
  geom_alluvium(aes(fill = Class),
                width = 0, knot.pos = 0, reverse = FALSE) +
  guides(fill = FALSE) +
  geom_stratum(width = 1/8, reverse = FALSE) +
  geom_text(stat = "stratum", label.strata = TRUE, reverse = FALSE) +
  scale_x_continuous(expand = c(0, 0), 
                     breaks = 1:3, labels = c("Survived", "Sex", "Class")) +
  scale_y_discrete(expand = c(0, 0)) +
  coord_flip() +
  ggtitle("Titanic survival by class and sex")

Created on 2018-11-13 by the reprex package (v0.2.1.9000)

Tung
  • 26,371
  • 7
  • 91
  • 115
6

Judging by these definitions this function, like the Parallel Sets Plot, lacks the capacity to split and combine flows (i.e. through more than one transition).

Since Sankey diagrams are directed weighted graphs, a package like qgraph might be useful.

The SankeyR function provides clearer labels if you sort the losses in descending order as the text is placed closer to the arrow heads without overlapping.

RobinGower
  • 928
  • 6
  • 14
  • 1
    Sorting _the losses in descending order_ would break the directional quality of the diagram. If you look closely at the diagram I submitted you will see that _time_ is on the x-axis, hence the current order. I'm aware of [sankey-diagrams.com](http://www.sankey-diagrams.com/) and the articles on it, my first thought when I saw that website was to open op R and produce a nice Sankey Diagram in [ggplot2](http://had.co.nz/ggplot2/). – Eric Fail Apr 04 '12 at 01:52
5

have a look at //sankeybuilder.com as it offers a ready to go solution where you can upload your data and playback variations over time. The transition works well (similar to the youtube demo in your question). If you load the SankeyTrend demo it includes many time slots (Years of data). Once loaded (builds sankeys automatically), click the play button in the upper right hand corner of the page for playback of the time slots, you can even pause and resume time. Demo url is here: SankeyTrend Hope this helps your quest for the perfect Sankey diagram.

Rob
  • 1,226
  • 3
  • 23
  • 41
1

Just open sourced a package that uses an alluvial diagram to visualize workflow stages. Since history is kept when the alluvial form is used, there aren't any crossovers in the edges.

https://github.com/claytontstanley/shiny.alluvial

enter image description here

Clayton Stanley
  • 7,513
  • 9
  • 32
  • 46