0

I want to make a scatterplot of two categorical variables (both are 0 and 1), but when I make a normal scatterplot in python there are only four dots of the same size. All I have is a pandas data frame with two columns (A and B) each full of 0s and 1s.

https://pypi.org/project/bubble-plot/

I ran something like the example in the above link

bubble_plot(df, x = 'A', y = 'B')

And I think it gave me what I want, but I have no idea how to get a legend showing what the size or colors mean.

Any idea on how to get a bubble plot with a legend?

Thank you!

KVHelpMe
  • 81
  • 5
  • can you show what your df looks like ? https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples – StupidWolf Dec 16 '20 at 06:50

1 Answers1

1

there are only four dots of the same size

Well, in your specific case both the x- and y-Values contain only [0,1,1,0,..], so the bubble_plot() can only show you bubbles positioned at [0,0], [0,1], [1,0], [1,1]. The different sizes give you the correlation of columns 'A' and 'B', i.e. the size of the bubble at [1,0] shows in how many rows there was a 1 in column 'A' and a 0 in column 'B'.

If you add a import matplotlib.pyplot as plt and plt.colorbar(), you'll be able to see that the colours mean the same as the sizes:

import pandas as pd
import numpy as np
from bubble_plot.bubble_plot import bubble_plot
import matplotlib.pyplot as plt

np.random.seed(2020)

A = np.random.choice([0,1],size=50)
B = np.random.choice([0,1],size=50)

df = pd.DataFrame({'A':A, 'B':B})

bubble_plot(df, x='A', y='B')

plt.colorbar()
plt.show()

And if you were to use h = plt.hist2d(df['A'], df['B'], bins=2) instead of the bubble_plot(), you could use print(h[0]) to get the distribution information:

[[13. 15.]
 [14.  8.]]

or, normalised print(h[0]/h[0].sum()):

[[0.26 0.3 ]
 [0.28 0.16]]

i.e. in 16% of the dataset a 1 in df['A'] correlates with a 0 in df['B'].

Asmus
  • 5,117
  • 1
  • 16
  • 21