11

I am struggling to find a Numpy equivalent for a particular Matlab coding "pattern" using ismember.

Unfortunately this code tends to be where most of the time is spent in my Matlab scripts so I want to find an efficient Numpy equivalent.

The basic pattern consists of mapping a subset onto a larger grid. I have a set of key value pairs stored as parallel arrays and I want to insert these values into a larger list of key value pairs stored in the same way.

For concreteness say I have quarterly GDP data that I map onto a monthly time grid as follows.

quarters = [200712 200803 200806 200809 200812 200903];
gdp_q = [10.1 10.5 11.1 11.8 10.9 10.3];
months = 200801 : 200812;
gdp_m = NaN(size(months));
[tf, loc] = ismember(quarters, months);
gdp_m(loc(tf)) = gdp_q(tf);

Note that not all the quarters appear in the list of months so both the tf and the loc variables are required.

I have seen similar questions on StackOverflow but they either just give a pure Python solution (here) or where numpy is used then the loc argument isn't returned (here).

In my particular application area, this particular code pattern tends to arise over and over again and uses up most of the CPU time of my functions so an efficient solution here is really crucial for me.

Comments or redesign suggestions are also welcome.

Community
  • 1
  • 1
snth
  • 5,194
  • 4
  • 39
  • 48
  • Following if you would implement it yourself: 1. for objects take hash, you already have numbers - sort them and use binary search. 2. Another approach - use hashmap – Mikhail Poda Nov 26 '10 at 18:02
  • I think this [answer by Alex Martelli](http://stackoverflow.com/questions/1273041/how-can-i-implement-matlabs-ismember-command-in-python/1273815#1273815) is the best you can get. – Sven Marnach Nov 27 '10 at 11:11

3 Answers3

6

If months is sorted, use np.searchsorted. Otherwise, sort and then use np.searchsorted:

import numpy as np
quarters = np.array([200712, 200803, 200806, 200809, 200812, 200903])
months = np.arange(200801, 200813)
loc = np.searchsorted(months, quarters)

np.searchsorted returns the insertion position. If there is a possibility that your data is not even in the right range, you might want to have a check afterwards:

valid = (quarters <= months.max()) & (quarters >= months.min())
loc = loc[valid]

This is a O(N log N) solution. If this is still a big deal in your programme in terms of run time, you might just do this one subroutine in C(++) using a hashing scheme, which would be O(N) (as well as avoiding some constant factors, of course).

luispedro
  • 6,934
  • 4
  • 35
  • 45
  • Thanks. Could you give a quick overview of the idea behind the hashing scheme? In my case, most of the time both key arrays will probably be sorted so a linear O(N) scheme that simply walks through both in parallel should work. I'm always very hesitant to write any C code myself though - mostly because I haven't taken the time to really investigate how to link in my own extensions. – snth Nov 29 '10 at 15:19
  • I was thinking something along the lines of "hash = dict((val,i) for i,val in enumerate(months)); result = [hash[j] for j in quarters if j in months]" but coded in C. I'd use C++ in order to be able to use the STL hash type. – luispedro Nov 30 '10 at 00:45
2

I think you can redesign the original MATLAB code sample you give so that it doesn't use the ISMEMBER function. This may speed up the MATLAB code and make it easier to reimplement in Python if you still want to:

quarters = [200712 200803 200806 200809 200812 200903];
gdp_q = [10.1 10.5 11.1 11.8 10.9 10.3];

monthStart = 200801;              %# Starting month value
monthEnd = 200812;                %# Ending month value
nMonths = monthEnd-monthStart+1;  %# Number of months
gdp_m = NaN(1,nMonths);           %# Initialize gdp_m

quarters = quarters-monthStart+1;  %# Shift quarter values so they can be
                                   %#   used as indices into gdp_m
index = (quarters >= 1) & (quarters <= nMonths);  %# Logical index of quarters
                                                  %#   within month range
gdp_m(quarters(index)) = gdp_q(index);  %# Move values from gdp_q to gdp_m
gnovice
  • 125,304
  • 15
  • 256
  • 359
  • +1: ismember does all kinds of additional stuff, like calling `unique` that are not needed in your case, you you can definitely streamline the Matlab (or numpy) code. – Jonas Nov 26 '10 at 18:22
0

Try the ismember library from pypi.

pip install ismember

Example:

# Import library
from ismember import ismember

# data
quarters = np.array([200712, 200803, 200806, 200809, 200812, 200903])
months = np.arange(200801, 200812)

# Lookup
Iloc,idx=ismember(quarters, months)

# Iloc is boolean defining existence of quarters in months 
print(Iloc)
# [False,  True,  True,  True, False, False]

# index of months that exists in quarters
print(idx)
# [2, 5, 8]

print(months[idx])
[200803, 200806, 200809]

print(quarters[Iloc])
[200803, 200806, 200809]

# These vectors will match
quarters[Iloc]==months[idx]
erdogant
  • 1,544
  • 14
  • 23