2

In the question "Efficiently create a density plot for high-density regions, points for sparse regions" there is asked to replace the low density regions with NaNs. The relevant code in the accepted answer is the following:

hh, locx, locy = scipy.histogram2d(xdat, ydat, range=xyrange, bins=bins)
posx = np.digitize(xdat, locx)
posy = np.digitize(ydat, locy)

#select points within the histogram
ind = (posx > 0) & (posx <= bins[0]) & (posy > 0) & (posy <= bins[1])
hhsub = hh[posx[ind] - 1, posy[ind] - 1] # values of the histogram where the points are
xdat1 = xdat[ind][hhsub < thresh] # low density points
ydat1 = ydat[ind][hhsub < thresh]
hh[hh < thresh] = np.nan # fill the areas with low density by NaNs

I found that something like

hh = np.where(hh > thresh, hh, np.nan)

is also working. What are the differences, both in performing as results?

Community
  • 1
  • 1
Mathias711
  • 6,568
  • 4
  • 41
  • 58
  • Your original approach is known as advanced indexing and is a feature rather than a work-around! – jkalden Jan 29 '15 at 15:36
  • @jkalden it was not my approach, I just copied and pasted it, and it worked. Later I found out that np.where() also did the job, and in much less lines (and I understood it, most important) – Mathias711 Jan 29 '15 at 15:55
  • Take a look at `np.where(hh>thresh)`. Try selecting array elements with this tuple. It may help you understand what the longer form of `where` does. – hpaulj Jan 29 '15 at 19:09

1 Answers1

1

Advanced indexing (i.e. your original approach) is more efficient while the results are equal!

I compared both options with the following code:

import numpy as np
import time

t1 = time.time()
for i in xrange(1000):
        a = np.random.rand(10)
        a[a>0.5] = np.nan
t2 = time.time()
print 'adv. idx.: ',t2-t1

t1 = time.time()
for i in xrange(1000):
        a = np.random.rand(10)
        a = np.where(a>0.5,np.nan,a)
t2 = time.time()
print 'np.where: ',t2-t1

The result is rather obvious:

adv. idx.:  0.00600004196167
np.where:  0.0339999198914

np.where is significantly slower! The results where identical. However, this is not verifiable via == comparison as np.nan == np.nan yields False.

jkalden
  • 1,548
  • 4
  • 24
  • 26
  • But what is the whole hassle with the indices, and get the position of x/y data, et cetera? – Mathias711 Jan 29 '15 at 15:57
  • `thresh` is not defined in your code snippet, so I don't have an idea how the processing of `xdat, ydat` affects `thresh`. From my point of view, the question was about the difference between `hh[hh < thresh] = np.nan` and `hh = np.where(hh > thresh, hh, np.nan)`. Actually, `np.where(hh > thresh, hh, np.nan)` can only replace `hh[hh < thresh] = np.nan`, the rest of the job has to be done elsewhere, e.g. as shown in the lines 1-9 in your snippet! – jkalden Jan 29 '15 at 16:21