how to create a mask from time points for a numpy array?

Question

data is a matrix containing 2500 time series of a measurment. I need to average each time series over time, discarding data points that were recorded around a spike (in the interval tspike-dt*10... tspike+10*dt). The number of spiketimes is variable for each neuron and stored in a dictionary with 2500 entries. My current code iterates over neurons and spiketimes and sets the masked values to NaN. Then bottleneck.nanmean() is called. However this code is to slow in the current version, and I am wondering wheater there is a faster solution. thanks!

import bottleneck
import numpy as np
from numpy.random import rand, randint

t = 1
dt = 1e-4
N = 2500
dtbin = 10*dt

data = np.float32(ones((N, t/dt)))
times = np.arange(0,t,dt)
spiketimes = dict.fromkeys(np.arange(N))
for key in spiketimes:
  spiketimes[key] = rand(randint(100))

means = np.empty(N)

for i in range(N):        
  spike_times = spiketimes[i]
  datarow = data[i]
  if len(spike_times) > 0:
    for spike_time in spike_times:                        
      start=max(spike_time-dtbin,0)
      end=min(spike_time+dtbin,t)
      idx = np.all([times>=start,times<=end],0)
      datarow[idx] = np.NaN
  means[i] = bottleneck.nanmean(datarow)

jmetz · Answer 1 · 2012-08-03T20:44:46.383

0

Instead of using nanmean you could just index the values you need and use mean.

means[i] = data[ (times<start) | (times>end) ].mean()

If I misunderstood and you do need your indexing, you might try

means[i] = data[numpy.logical_not( np.all([times>=start,times<=end],0) )].mean()

Also in the code you probably want to not use if len(spike_times) > 0 (I assume you remove the spike time at each iteration or else that statement will always be true and you'll have an infinite loop), only use for spike_time in spike_times.

edited Aug 03 '12 at 20:44

answered Aug 03 '12 at 19:28

jmetz

12,144
3
30
41

taking the means should already be optimized. according to http://stackoverflow.com/questions/5480694/numpy-calculate-averages-with-nans-removed the bottleneck.mean() is fastest for taking means over masked arrays. I was hoping creating a mask from a dictionary of spiketimes without iteration could bring an improvement in performance – maryam roayaee Aug 03 '12 at 20:32
@maryamroayaee: I don't think you need to have `NaN`'s or use a mask at all - you can just index into the values you want and take the `mean` - that should be faster than setting elements to NaN. – jmetz Aug 03 '12 at 20:36
@maryamroayaee: Also there's a mistake in your code I think: because when you set elements to NaN in each iteration, the elements are not restored to their pre-NaN values for the next iteration! – jmetz Aug 03 '12 at 20:40

Luke · Accepted Answer · 2012-08-04T21:30:23.677

The vast majority of the processing time in your code comes from this line:

idx = np.all([times>=start,times<=end],0)

This is because for each spike, you are comparing every value in times against start and end. Since you have uniform time steps in this example (and I presume this is true in your data as well), it is much faster to simply compute the start and end indexes:

# This replaces the last loop in your example:
for i in range(N):        
    spike_times = spiketimes[i]
    datarow = data[i]
    if len(spike_times) > 0:
        for spike_time in spike_times:
            start=max(spike_time-dtbin,0)
            end=min(spike_time+dtbin,t)
            #idx = np.all([times>=start,times<=end],0)
            #datarow[idx] = np.NaN
            datarow[int(start/dt):int(end/dt)] = np.NaN
    ## replaced this with equivalent for testing
    means[i] = datarow[~np.isnan(datarow)].mean()

This reduces the run time for me from ~100s to ~1.5s. You can also shave off a bit more time by vectorizing the loop over spike_times. The effect of this will depend on the characteristics of your data (should be most effective for high spike rates):

kernel = np.ones(20, dtype=bool)
for i in range(N):        
    spike_times = spiketimes[i]
    datarow = data[i]
    mask = np.zeros(len(datarow), dtype=bool)
    indexes = (spike_times / dt).astype(int)
    mask[indexes] = True  
    mask = np.convolve(mask, kernel)[10:-9]

    means[i] = datarow[~mask].mean()

Vectorizing the inner loop was wthat I woss looking for. also thanks for the hint to use convolve for creating the intervals for the mask. In my I have a speedup from several minutes to under a second — maryam roayaee, Aug 10 '12 at 09:26

how to create a mask from time points for a numpy array?

2 Answers2