Compare Cluster and Overall distributions of categorical variables using pie charts

Question

In the context of cluster profiling, I am trying to visualize categorical variables distribution of each cluster compared to the overall population.

In order to make them comparable, I use the Relative Frequency.

For numerical variable is pretty straigthforward because I can easily overlay histograms.

Instead, for categorical variable I would like to obtain something like this:

In which the external piechart visualizes the Relative Frequency of Cluster 1 and the internal piechart represents the Relative Frequency of the Overall Population.

An reproducible example is:

mydf <- data.frame(week_day = as.factor(c(rep("monday",10), rep("monday",5), rep("tuesday",5))), cluster = c(rep(1,10), rep(2,10)))

Here, Cluster 1 is exclusively composed by "monday", whereas the Overall Population is composed 75% "monday" and 25% "tuesday".

The Relative Frequency within ggplot aes can be easily computed using:

y = (..count..)/sum(..count..)

Look into the package `sunburstR`. An example given by @StevenBeaupré in [this thread](https://stackoverflow.com/questions/33594642/beautiful-pie-charts-with-r/33594843) seems to include exactly what you want. — LAP, Jan 16 '18 at 10:11
thank you @LAP actually, I already found that post useful but it does not provide the solution for this problem — Seymour, Jan 16 '18 at 10:22

score 2 · Accepted Answer · answered Jan 16 '18 at 11:50

Let's assume you are looking at a variable with 4 categories A B C D, and you have this sort of dataframe.

d <- tribble(~Category, ~Overall, ~Cluster1,
         "A", 250, 20,
         "B", 250, 110,
         "C", 250, 30,
         "D", 250, 40) %>%
gather(Overall, Cluster1, key = "Cluster", value = "Count")

which would mean: "overall the dataset, 250 points have category A, 250 have category B, etc. and in the Cluster1, 20 points have category A, 110 have category B, etc.

ggplot assumes a pie chart is a (scaled) bar chart plotted with polar coordinates.

To get a bar chart with relative frequencies, specify a position = "fill" argument in geom_bar

ggplot(data = d) +
geom_bar(stat = "identity",
         position = "fill", #automatically scales the bars form 0 to 1, necessary for polar corrdinates
         aes(x = Cluster, y = Count, fill = Category))

which gives you the following chart: Bar chart with relatives frequences

Then, you need to switch to polar coordinates, and specify the y-axis as angular parameter. The radial parameters will be your clusters/overall distribution.

You should pay attention to the order of factor levels, so that you get the right thing (here: the overall distribution) in the middle of the circles. My solution for the example is not meant to be optimal:

d$Cluster <- factor(d$Cluster, levels = c("Overall","Cluster1"))
#`Overall` has the lowest factor index to be displayed

And then, add the coord_polar layer:

ggplot(data = d) +
geom_bar(stat = "identity",
         position = "fill", #automatically scales the bars form 0 to 1, necessary for polar corrdinates
         aes(x = Cluster, y = Count, fill = Category),
         width = .9) + #play with the width of the bins for the blank space between the circles. 1 = no blank space
coord_polar(theta = "y") +#the y coordinated becomes the angular parameter
theme(axis.text.y = element_blank()) #I didn't look for a fancy way to display radial labels

Which gives you:

Pie chart with relative frequences

Compare Cluster and Overall distributions of categorical variables using pie charts

1 Answers1