1

As a beginner in data science, I want to cluster data to visualize the distribution of the data.

This is the current state. Each point is a data point with some x and y value.

current visualisation

I want to get something like this. So I want to count all data points in a 2d-grid-cell and replace it with one point, that size shows the count of the data-point in that 'cluster-grid-point'

I'm pretty sure there is a pandas/matplotlib function that will help me - but on clustering or grouping, I found nothing helpful.

that is my goal / the larger the point, the more data points are in that 'cluster'-grid-cell

Mr. T
  • 11,960
  • 10
  • 32
  • 54
Reinhard
  • 1,516
  • 1
  • 18
  • 25
  • I suggest providing the code so far as a [minimal, complete, and reproducible example](https://stackoverflow.com/help/minimal-reproducible-example). You should also explain what the grouping conditions are. Pandas [groupby](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html) will most likely be what you are looking for. – Mr. T Mar 19 '22 at 18:53
  • I think you might be looking for something like this: https://stackoverflow.com/questions/43422961/2-dimensional-binning-with-pandas – nickdmax Mar 19 '22 at 19:35

1 Answers1

1

Here is my crack at it -- note I am no matplot-wiz or pandas ninja (I am more of an R/ggplot guy). There are probably easier ways to work with the data in python/pandas.

import numpy as np
print('numpy: {}'.format(np.__version__))
import matplotlib as mpl
print('matplotlib: {}'.format(mpl.__version__))
import pandas as pd
print('pandas: {}'.format(pd.__version__))
%matplotlib inline
import matplotlib.pyplot as plt

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

#  Define the names of the variables as we want them

names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'species']
iris = pd.read_csv(url, names=names)

plt.figure(figsize=(8,4),dpi=288)
iris.plot(kind='scatter', x="petal-length", y="petal-width")
plt.show()

d1 = iris.assign(
    petal_length_cut = pd.qcut(iris['petal-length'],5, labels=np.linspace(1,7,5)),
    petal_width_cut = pd.qcut(iris['petal-width'],5, labels=np.linspace(0,2.5,5))
)
d2 = d1.assign(cartesian=pd.Categorical(d1.filter(regex='_cut').apply(tuple, 1)))
d3 = d2[['petal-length', 'petal-width', 'cartesian']]
print(d3)
hist = d3['cartesian'].value_counts()
print(hist)

x=[c[0]+.25 for c in hist.index]
y=[c[1]+.5 for c in hist.index]
s=[hist[c]* 10 for c in hist.index]
plt.scatter(x,y,s=s)
plt.show

original plot iris petal length vs width enter image description here

Got a little better control of the binning and placement using:

length_bins = pd.cut(iris['petal-length'],7)
width_bins = pd.cut(iris['petal-width'],5)
bins = pd.DataFrame({"l":length_bins, "w":width_bins})
hist = bins.value_counts()

hist.index = [(i[0].mid, i[1].mid) for i in hist.index]
#print(hist)

x=[c[0] for c in hist.index]
y=[c[1] for c in hist.index]
s=[hist[c]* 10 for c in hist.index]
plt.xlim([0,7])
plt.ylim([0,2.5])
plt.scatter(x,y,s=s)
plt.show

enter image description here

nickdmax
  • 539
  • 2
  • 4
  • 11