0

I'm running into an issue trying to create a color map within a scatterplot. Here's the portion of my code:

   import numpy as np
   import matplotlib.pyplot as plt
   from scipy.stats import gaussian_kde

    f, ax = plt.subplots()

    xy = np.vstack([x, y])
    xy = xy[~np.isnan(xy)]
    z = gaussian_kde(xy)(xy)

    idx = z.argsort()
    x, y, z = x[idx], y[idx], z[idx]

    plt.scatter(x, y, c=z, cmap='Reds', alpha=0.5)

x and y are both columns within my panda dataframe and they both do have NaN values. I tried taking out all the NaN values by doing ~np.isnan(xy) to only get actual values since it wasn't allowing me to take infs or NaNs since I believe gaussian_kde() was throwing that error. Also, both columns don't align with each other in terms of where those NaN values are and one column has more NaN values than the other. Both also have the same amount of elements. When I run my code, it just keeps running and I have to stop it. Any ideas what's possibly wrong?

Mr.Riply
  • 825
  • 1
  • 12
  • 34
researchnewbie
  • 100
  • 1
  • 10
  • When posting your example please show the libs you are importing. We do not know where `gaussian_kde` comes from. What are the typical size of the x and y vectors ? – Liris Nov 08 '19 at 08:56
  • @Liris I went ahead and updated to add the imported packages. x and y vectors are around 200k in size. – researchnewbie Nov 09 '19 at 02:40

1 Answers1

1

You have to filter the Nans using:

inds = ~np.logical_or(np.isnan(x), np.isnan(y))
x = x[inds]
y = y[inds]

From this example, I think your code should look like:

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde

# Generate fake data
x = np.random.normal(size=1000)
y = x * 3 + np.random.normal(size=1000)

# removing nans in both vectors at the same place
inds = ~np.logical_or(np.isnan(x), np.isnan(y))
x = x[inds]
y = y[inds]

# Calculate the point density
xy = np.vstack([x,y])
z = gaussian_kde(xy)(xy)

fig, ax = plt.subplots()
ax.scatter(x, y, c=z, s=100, edgecolor='')
plt.show()

Just keep in mind that if x and y are very large vectors, gaussian_kde can take a long time to run. For a vector length of 50000, it takes about 40.5 sec to run.

Liris
  • 1,399
  • 3
  • 11
  • 29
  • I went ahead and compiled, it worked! Interesting, it did take 300 seconds for it to compile due to the large vector sizes I have. Is there anyway to reduce this time without using gaussian_kde? I'm trying to create a colormap that would display the areas with the most overlapping points. – researchnewbie Nov 09 '19 at 02:59
  • I don't think so. If you have a number of point large enough, you can pick some of them randomly to reduce the number of point you will feed to `gaussian_kde` ! – Liris Nov 09 '19 at 19:33