I have the following synthetic dataframe, including numerical and categorical columns as well as the label
column.
I want to plot a diagonal correlation matrix and display correlation coefficients in the upper part as the following:
expected output:
Despite the point that categorical columns within synthetic dataset/dataframedf
needs to be converted into numerical, So far I have used this seaborn example using 'titanic'
dataset which is synthetic and fits my task, but I added label
column as follows:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme(style="white")
# Generate a large random dataset with synthetic nature (categorical + numerical)
data = sns.load_dataset("titanic")
df = pd.DataFrame(data=data)
# Generate label column randomly '0' or '1'
df['label'] = np.random.randint(0,2, size=len(df))
# Compute the correlation matrix
corr = df.corr()
# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmin=-1.0, vmax=1.0, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5})
I checked a related post but couldn't figure it out to do this task. The best I could find so far is this workaround which can be installed using this package that gives me the following output:
#!pip install heatmapz
# Import the two methods from heatmap library
from heatmap import heatmap, corrplot
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme(style="white")
# Generate a large random dataset
data = sns.load_dataset("titanic")
df = pd.DataFrame(data=data)
# Generate label column randomly '0' or '1'
df['label'] = np.random.randint(0,2, size=len(df))
# Compute the correlation matrix
corr = df.corr()
# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))
mask[np.diag_indices_from(mask)] = False
np.fill_diagonal(mask, True)
# Set up the matplotlib figure
plt.figure(figsize=(8, 8))
# Draw the heatmap using "Heatmapz" package
corrplot(corr[mask], size_scale=300)
Sadly, corr[mask]
doesn't mask the upper triangle in this package.
I also noticed that in R, reaching this fancy plot is much easier, so I'm open if there is a more straightforward way to convert Python Pandas dataFrame to R dataframe since it seems there is a package, so-called rpy2
that we could use Python & R together even in Google Colab notebook: Ref.1
from rpy2.robjects import pandas2ri
pandas2ri.activate()
So if it is the case, I find this post1 & post2 using R for regarding Visualization of a correlation matrix.
So, in short, my 1st priority is using Python and its packages Matplotlib
, seaborn
, Plotly Express
, and then R and its packages to reach the expected output.
Note
I provided you with executable code in google Colab notebook with R using dataset so that you can form/test your final answer if your solution is by rpy2
otherwise I'd be interested in a Pythonic solution.