9

I have the following synthetic dataframe, including numerical and categorical columns as well as the label column. I want to plot a diagonal correlation matrix and display correlation coefficients in the upper part as the following:

expected output:

img

Despite the point that categorical columns within synthetic dataset/dataframedf needs to be converted into numerical, So far I have used this seaborn example using 'titanic' dataset which is synthetic and fits my task, but I added label column as follows:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

sns.set_theme(style="white")

# Generate a large random dataset with synthetic nature (categorical + numerical)
data = sns.load_dataset("titanic")
df = pd.DataFrame(data=data)

# Generate label column randomly '0' or '1'
df['label'] = np.random.randint(0,2, size=len(df))

# Compute the correlation matrix
corr = df.corr()

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmin=-1.0, vmax=1.0, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

img

I checked a related post but couldn't figure it out to do this task. The best I could find so far is this workaround which can be installed using this package that gives me the following output:

#!pip install heatmapz
# Import the two methods from heatmap library
from heatmap import heatmap, corrplot
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

sns.set_theme(style="white")

# Generate a large random dataset
data = sns.load_dataset("titanic")
df = pd.DataFrame(data=data)

# Generate label column randomly '0' or '1'
df['label'] = np.random.randint(0,2, size=len(df))

# Compute the correlation matrix
corr = df.corr()

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool)) 
mask[np.diag_indices_from(mask)] = False
np.fill_diagonal(mask, True)

# Set up the matplotlib figure
plt.figure(figsize=(8, 8))

# Draw the heatmap using "Heatmapz" package
corrplot(corr[mask], size_scale=300)

img

Sadly, corr[mask] doesn't mask the upper triangle in this package.

I also noticed that in R, reaching this fancy plot is much easier, so I'm open if there is a more straightforward way to convert Python Pandas dataFrame to R dataframe since it seems there is a package, so-called rpy2 that we could use Python & R together even in Google Colab notebook: Ref.1

from rpy2.robjects import pandas2ri
pandas2ri.activate() 

So if it is the case, I find this post1 & post2 using R for regarding Visualization of a correlation matrix. So, in short, my 1st priority is using Python and its packages Matplotlib, seaborn, Plotly Express, and then R and its packages to reach the expected output.

Note

I provided you with executable code in google Colab notebook with R using dataset so that you can form/test your final answer if your solution is by rpy2 otherwise I'd be interested in a Pythonic solution.

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
Mario
  • 1,631
  • 2
  • 21
  • 51
  • could you explain a little more about what you are looking for? In your sample data, I cant tell what you are trying to make a correlation matrix of. Do you need to pivot the Type column wider? – AndS. Sep 03 '22 at 14:47
  • I updated the post that was a motivation example with a small `df`. I have some features/columns *categorical* or *numerical* as well as the `label` column (*Boolean*) within `df` . So I want to demonstrate their possible **linear relationship** within `df` columns using a correlation matrix in a fancy way as shown in the expected output including displaying the coefficients **only** on the upper triangle. in the bottom triangle, I want to use squares of different size . The pivot table helps in terms of statistics reports like the bar chart over Type or Length concerning class using `label`. – Mario Sep 03 '22 at 15:43

4 Answers4

4

I'm not an expert in rpy2, so I can't help there, but here is how I would build it out in R. Since I don't have your data, I can't promise that everything will work perfectly for your dataset, but here is a general outline:

library(tidyverse)

#get some data
df <- as_tibble(mtcars) |>
  (\(d) select(d, order(colnames(d))))()
  
#calculate correlation matrix
cor_mat <- cor(df) 

#make 2 "blank" matrices
low <- matrix(NA, nrow = nrow(cor_mat), ncol = ncol(cor_mat))
up <- matrix(NA, nrow = nrow(cor_mat), ncol = ncol(cor_mat))

#populate upper and lower matrices
up[upper.tri(up)] <- cor_mat[upper.tri(cor_mat)]
low[lower.tri(low)] <- cor_mat[lower.tri(cor_mat)]


#pivot upper and lower for plotting
lower_dat <- low|>
  as.data.frame() |>
  `colnames<-`(colnames(df)) |>
  mutate(xvar = colnames(df)) |>
  pivot_longer(cols = -xvar, names_to = "yvar") 

upper_dat <- up|>
  as.data.frame() |>
  `colnames<-`(colnames(df)) |>
  mutate(xvar = colnames(df)) |>
  pivot_longer(cols = -xvar, names_to = "yvar") 


#plot
lower_dat|> #lower matrix data
  ggplot(aes((xvar), yvar))+ 
  geom_tile(fill = NA, color = "grey")+ #background grid
  geom_point(aes(fill = value, size = value), pch = 22)+ # differnt sized points
  geom_text(data = upper_dat, aes(color = value, label = round(value, 2)))+ #plot cor in upper right
  scale_size_continuous(breaks = seq(-1, 1, by = 0.5))+ # define size breaks
  labs(x = "", y = "")+ #remove unnecessary labels
  scale_fill_gradient2(low = "darkred",mid = "white", high = "darkblue", midpoint = 0)+ #define square colors
  scale_color_gradient2(low = "darkred",mid = "white", high = "darkblue", midpoint = 0)+ #define text colors
  scale_x_discrete(limits = rev)+# rev to make the triagle a certain side
   #make it look pretty
  theme(panel.background = element_blank(), 
        panel.border = element_rect(fill = NA, color = "black"),
        axis.text = element_text(color = "black", size = 10),
        axis.title = element_text(size = 12))

AndS.
  • 7,748
  • 2
  • 12
  • 17
  • Thanks for your input. It would be great if you could provide me with executable code in google [Colab notebook with R](https://colab.research.google.com/drive/1SDnbG3Ln2g-ti4tuWbIyLZZczLpu7CCK?usp=sharing) using [dataset](https://drive.google.com/file/d/1_e1vhAt7J4I-mpg86u2-nVlomrubY0d_/view?usp=sharing) so that you can form your final answer. So it's not possible to catch this plot via Python? – Mario Sep 04 '22 at 17:11
  • I do most of my plotting in R, but I'm sure this could be done in python as well. Again, I don't use R in google Colab, so I can't help you further than this. Hopefully you can use this as a jumping off point. – AndS. Sep 05 '22 at 10:39
2

Another option is creating two corrplots from the corrplot package in R. You can specify one plot with add=TRUE to combine both plots. Here is a reproducible example with mtcars dataset:

library(corrplot)
M<-cor(mtcars)
diag(M) <- 0
corrplot(M, method="number", type = "upper", tl.pos = "t")
corrplot(M, method="square", type = "lower", tl.pos = "l", cl.pos = "n", add = TRUE)

Output:

enter image description here

Quinten
  • 35,235
  • 5
  • 20
  • 53
  • Thanks for your input. I tried your solution [here](https://colab.research.google.com/drive/1SDnbG3Ln2g-ti4tuWbIyLZZczLpu7CCK?usp=sharing#scrollTo=9q_wSxtiq26p) and it was pretty straight forward but considering as I mentioned in bounty description on post: *I need a Pythonic solution that can be executable easily, at least on Google Colab Notebook.* I'm much into **Pythonic** solution. maybe you can adapt your solution using `rpy2` and extra value by bridging the gap between Python and R. Please feel free to use shared notebook. – Mario Sep 12 '22 at 17:16
1

I'd be interested in a Pythonic solution.

Use a seaborn scatter plot with matplotlib text/line annotations:

  1. Plot the lower triangle via sns.scatterplot with square markers
  2. Annotate the upper triangle via plt.text
  3. Draw the heatmap grid via plt.vlines and plt.hlines

Full code using the titanic sample:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(style="white")

# generate sample correlation matrix
df = sns.load_dataset("titanic")
df["label"] = np.random.randint(0, 2, size=len(df))
corr = df.corr()

# mask and melt correlation matrix
mask = np.tril(np.ones_like(corr, dtype=bool)) | corr.abs().le(0.1)
melt = corr.mask(mask).melt(ignore_index=False).reset_index()
melt["size"] = melt["value"].abs()

fig, ax = plt.subplots(figsize=(8, 6))

# normalize colorbar
cmap = plt.cm.RdBu
norm = plt.Normalize(-1, 1)
sm = plt.cm.ScalarMappable(norm=norm, cmap=cmap)
cbar = plt.colorbar(sm, ax=ax)
cbar.ax.tick_params(labelsize="x-small")

# plot lower triangle (scatter plot with normalized hue and square markers)
sns.scatterplot(ax=ax, data=melt, x="index", y="variable", size="size",
                hue="value", hue_norm=norm, palette=cmap,
                style=0, markers=["s"], legend=False)

# format grid
xmin, xmax = (-0.5, corr.shape[0] - 0.5)
ymin, ymax = (-0.5, corr.shape[1] - 0.5)
ax.vlines(np.arange(xmin, xmax + 1), ymin, ymax, lw=1, color="silver")
ax.hlines(np.arange(ymin, ymax + 1), xmin, xmax, lw=1, color="silver")
ax.set(aspect=1, xlim=(xmin, xmax), ylim=(ymax, ymin), xlabel="", ylabel="")
ax.tick_params(labelbottom=False, labeltop=True)
plt.xticks(rotation=90)

# annotate upper triangle
for y in range(corr.shape[0]):
    for x in range(corr.shape[1]):
        value = corr.mask(mask).to_numpy()[y, x]
        if pd.notna(value):
            plt.text(x, y, f"{value:.2f}", size="x-small",
                     # color=sm.to_rgba(value), weight="bold",
                     ha="center", va="center")

Note that since most of these titanic correlations are low, I disabled the text coloring for readability.

If you want color-coded text, uncomment the color=sm.to_rgba(value) line at the end:

halfer
  • 19,824
  • 17
  • 99
  • 186
tdy
  • 36,675
  • 19
  • 86
  • 83
  • 1
    Thanks for your input using `sns.scatterplot()` as it had been addressed in this [workaround](https://towardsdatascience.com/better-heatmaps-and-correlation-matrix-plots-in-python-41445d0f2bec) on the post. There some few problems I couldn't figure them out with your solution as you can see in shared Google colab [notebook](https://colab.research.google.com/drive/1W3Ts8Sl3B_d3ZTILQZ9ZUXwGCu5CoqGg#scrollTo=YRHYUiAgrMqS): – Mario Sep 09 '22 at 08:18
  • 1
    1- How can adjust: `figsize=(10, 8)` 2- How can set threshold within mask argument: e.g. `ax = sns.heatmap(corr, mask=mask | (np.abs(corr) <= 0.1)` 3- I had to comment `cbar` since it plots two times with different size – Mario Sep 09 '22 at 08:18
  • 1
    @Mario The new code should address these issues: 1- I've updated it with `fig, ax = plt.subplots(figsize=(10, 8))` so the colorbar and scatter plot now use `ax=ax` 2- I've updated the original mask definition to `mask = np.tril(np.ones_like(corr, dtype=bool)) | corr.abs().le(0.1)` 3- Sorry, the duplicate colorbar was a typo and has now been removed – tdy Sep 09 '22 at 08:35
  • 4- I noticed that when I increased the `figsize=(10, 8)`, sadly, the size of coefficient texts on the upper triangle remains small and tiny, and the size of squares in the lower triangle doesn't represent coefficients which is the main idea as it had been addressed in **Expected output** on the post. Let's say squares size is not proportional with coefficients while you increase the fig size. How can we trim this? see the last cell in [notebook](https://colab.research.google.com/drive/1W3Ts8Sl3B_d3ZTILQZ9ZUXwGCu5CoqGg#scrollTo=akWhhZqZcR5Y) 5- also label size remains small (I can figure out) – Mario Sep 09 '22 at 09:20
0

I cannot setup heatmap package in Windows, but have you tried to set upper diagonal elements to nan?

corr_masked = corr.copy()
corr_masked[mask] = np.nan

corrplot(corr_masked, size_scale=300)

plt.plot for example does not plot nan samples, so the same trick may work here. If not, just setting the UD elements to 0 may suffice (or whatever color corresponds to the white on the scale).

kesh
  • 4,515
  • 2
  • 12
  • 20
  • I tried to adapt your solution using the `heatmapz` package. I also provided you with [Google colab](https://colab.research.google.com/drive/1W3Ts8Sl3B_d3ZTILQZ9ZUXwGCu5CoqGg?usp=sharing) for quick troubleshooting, so the problem is I couldn't manage to demonstrate coefficients on the upper triangle. In contrast, the lower triangle depicts as squares. I tried to use multi-masks (mask1 and mask2) unsuccessfully. – Mario Sep 08 '22 at 20:08