9

The figure below is plotted using the open-air R package:

a correlation matrix showing the relationships between variables

I know matplotlib has the plt.matshow function,
but it can't clearly show the relation between variables at the same time.

Here is my early work:

df is a pandas dataframe with 7 variables shows like below:

enter image description here

I don't know how to attach a .csv file to StackOverflow.

Using plt.matshow(df.corr(),cmap = plt.cm.Greens), the figure shows like this:

enter image description here

The second figure can't represent the correlation relations of the variables as clearly as the first one.

Edit:

I upload the csv file to Google docs here.

ali_m
  • 71,714
  • 23
  • 223
  • 298
Han Zhengzu
  • 3,694
  • 7
  • 44
  • 94
  • You should provide a basic dataset to work with. – Fabio Lamanna Jan 01 '16 at 13:00
  • Sorry, I'll provide it soon. – Han Zhengzu Jan 01 '16 at 13:09
  • Please don't post screenshots of your dataset - I can't copy/paste from an image. Paste the actual values into your question as text. – ali_m Jan 01 '16 at 15:33
  • 1
    What do you mean by representing the correlation relations? Do you mean the correlation coefficient values? If so, please take a look at seaborn's annotated heatmap https://stanford.edu/~mwaskom/software/seaborn/generated/seaborn.heatmap.html – ayhan Jan 01 '16 at 16:46
  • [Here's a related answer that uses the R `corrplot` package](http://stackoverflow.com/a/5453471/1461210) – ali_m Jan 02 '16 at 01:43

3 Answers3

13

I'm not aware of any existing Python library that does these "ellipse plots", but it's not particularly hard to implement using a matplotlib.collections.EllipseCollection:

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from matplotlib.collections import EllipseCollection

def plot_corr_ellipses(data, ax=None, **kwargs):

    M = np.array(data)
    if not M.ndim == 2:
        raise ValueError('data must be a 2D array')
    if ax is None:
        fig, ax = plt.subplots(1, 1, subplot_kw={'aspect':'equal'})
        ax.set_xlim(-0.5, M.shape[1] - 0.5)
        ax.set_ylim(-0.5, M.shape[0] - 0.5)

    # xy locations of each ellipse center
    xy = np.indices(M.shape)[::-1].reshape(2, -1).T

    # set the relative sizes of the major/minor axes according to the strength of
    # the positive/negative correlation
    w = np.ones_like(M).ravel()
    h = 1 - np.abs(M).ravel()
    a = 45 * np.sign(M).ravel()

    ec = EllipseCollection(widths=w, heights=h, angles=a, units='x', offsets=xy,
                           transOffset=ax.transData, array=M.ravel(), **kwargs)
    ax.add_collection(ec)

    # if data is a DataFrame, use the row/column names as tick labels
    if isinstance(data, pd.DataFrame):
        ax.set_xticks(np.arange(M.shape[1]))
        ax.set_xticklabels(data.columns, rotation=90)
        ax.set_yticks(np.arange(M.shape[0]))
        ax.set_yticklabels(data.index)

    return ec

For example, using your data:

data = df.corr()
fig, ax = plt.subplots(1, 1)
m = plot_corr_ellipses(data, ax=ax, cmap='Greens')
cb = fig.colorbar(m)
cb.set_label('Correlation coefficient')
ax.margins(0.1)

enter image description here

Negative correlations can be plotted as ellipses with the opposite orientation:

fig2, ax2 = plt.subplots(1, 1)
data2 = np.linspace(-1, 1, 9).reshape(3, 3)
m2 = plot_corr_ellipses(data2, ax=ax2, cmap='seismic', clim=[-1, 1])
cb2 = fig2.colorbar(m2)
ax2.margins(0.3)

enter image description here

ali_m
  • 71,714
  • 23
  • 223
  • 298
2

Assuming you are interested in showing cluster relations, the seaborn package mentioned in the comments also has a clustermap. Using your correlation matrix (looks like you want to show correlation coefficients as int in the [-100, 100] range, you could do the following:

corr = df.corr().mul(100).astype(int)

     GX   HG   RM   SJ   XB   XN   ZG
GX  100   77   62   71   48   66   57
HG   77  100   69   74   61   61   58
RM   62   69  100   75   48   64   68
SJ   71   74   75  100   50   70   65
XB   48   61   48   50  100   46   51
XN   66   61   64   70   46  100   75
ZG   57   58   68   65   51   75  100

and then use seaborn.clustermap() as follows:

import seaborn as sns
sns.clustermap(data=corr, annot=True, fmt='d', cmap='Greens').savefig('cluster.png')

enter image description here

Stefan
  • 41,759
  • 13
  • 76
  • 81
2

I just discovered this Python package biokit today. It provides a very handy function to create various kinds of correlation charts. For example:

In [1]: import pandas as pd

In [2]: import matplotlib.pyplot as plt
   ...: from biokit.viz import corrplot

In [6]: corr
Out[6]: 
      GX    HG    RM    SJ    XB    XN    ZG
GX  1.00 -0.77  0.62  0.71  0.48  0.66  0.57
HG -0.77  1.00  0.69  0.74  0.61  0.61  0.58
RM  0.62  0.69  1.00  0.75  0.48  0.64  0.68
SJ  0.71  0.74  0.75  1.00  0.50  0.70  0.65
XB  0.48  0.61  0.48  0.50  1.00 -0.46  0.51
XN  0.66  0.61  0.64  0.70 -0.46  1.00  0.75
ZG  0.57  0.58  0.68  0.65  0.51  0.75  1.00

I took Stefan's data and modified it a little bit. Let's assume this is a correlation matrix. Now to create a correlation chart, you can simply do this:

In [7]: c = corrplot.Corrplot(corr)
   ...: c.plot()

Correlation chart with ellipses

You can read more examples here.

Mengshan
  • 21
  • 6