Re-order numpy array based on where its associated ids are positioned in the `master_order` array

Question

I am looking for a function that makes a new array of values based on ordered_ids, when the array has a length of one million.

Input:

    >>> ids=array(["WYOMING01","TEXAS01","TEXAS02",...])
    >>> values=array([12,20,30,...])
    >>> ordered_ids=array(["TEXAS01","TEXAS02","ALABAMA01",...])

Output:

    ordered [  20 , 30 , nan , ...]

Closing Summary

@Dietrich's use of a dictionary in list comprehension is 10x faster than using numpy index search (numpy.where). I compared the times of three results in my answer below.

As a reaction to comment by Jamie, could you clarify if your ids are strings and do they have to be strings? And is your master_ids array equivalent to np.arange(n) up to a type or does it have missing values? — Martin, Feb 24 '14 at 09:06
Why are you working with numpy arrays? Why not a simple list of strings? — hpaulj, Mar 01 '14 at 21:54
In your example, master_order_ids appears to be a sorted list. If that's always the case, you wouldn't need it and your problem would be much simpler. — w-m, Mar 02 '14 at 23:16
you might be interested in this question: http://stackoverflow.com/questions/3403973/fast-replacement-of-values-in-a-numpy-array — Noah, Mar 03 '14 at 06:38

score 1 · Answer 1 · answered Feb 21 '14 at 09:41

1

You could try:

import numpy as np

def order_array(ids, values, master_order_ids):
    n = len(master_order_ids)
    idx = np.searchsorted(master_order_ids, ids)
    ordered_values = np.zeros(n)
    ordered_values[idx < n] = values[idx < n]
    print "ordered", ordered_values
    return ordered_values

Searchsorted gives you indices where you should insert ids into master_order_ids to keep the arrray ordered. Then you just drop those (idx, values) that are out of the range of master_order_ids.

answered Feb 21 '14 at 09:41

Martin

1,040
9
7

`["8", "9", "10"]` is not sorted for strings, so it would silently return bogus results. – Jaime Feb 21 '14 at 15:52
Right. I silently hoped that ids as strings is a mistake. My bad. – Martin Feb 24 '14 at 08:42

Dietrich · Accepted Answer · 2014-03-01T14:13:25.837

0

You could try using a dict() to associate the stings to your numbers. It simplifies the code considerably:

import numpy as np

def order_bydict(ids,values,master_order_ids):
    """ Using a dict to order ``master_order_ids`` """

    dd = dict([(k,v) for k,v in zip(ids, values)])  # create the dict
    ordered_values = [dd.get(m, 0) for m in master_order_ids]  # get() return 0 if key not found

    return np.asarray(ordered_values)  # return a numpy array instead of a list

The speedwise improvement is hard to predict without testing longer arrays (with your example it was 25% faster based on %timeit).

edited Mar 01 '14 at 14:13

answered Mar 01 '14 at 14:07

Dietrich

5,241
3
24
36

Thanks. My test (see in my answer below) with 10,000 elements in the arrays shows that your method is much better: not using dictionary: 1.32 ,using dictionary: 1.32 ,using dictionary with list comprehension: 0.013 – b_dev Mar 04 '14 at 02:37

b_dev · Answer 3 · 2014-03-04T20:22:17.300

import numpy
from numpy import copy, random, arange
import time

# SETUP    
N=10**4
ids = arange(0,N).astype(str)
values = arange(0,N)
numpy.random.shuffle(ids)
numpy.random.shuffle(values)
ordered_ids=arange(0,N).astype(str)


ordered_values = numpy.empty((N,1))
ordered_values[:] = numpy.NAN

# METHOD 1
start = time.clock()
for i in range(len(values)):ordered_values[ordered_ids==ids[i]]=values[i]
print "not using dictionary:", time.clock() - start

# METHOD 2
start = time.clock()
d = dict(zip(ids, values))
for k, v in d.iteritems(): ordered_values[ordered_ids==k] = v
print "using dictionary:", time.clock() - start

# METHOD 3 @Dietrich's approach in the answer above
start = time.clock()
dd = dict(zip(ids, values))
ordered_values = [dd.get(m, 0) for m in ordered_ids]
print "using dictionary with list comprehension:", time.clock() - start

Results

not using dictionary: 1.320237 # Method 1
using dictionary: 1.327119 # Method 2
using dictionary with list comprehension: 0.013287 # @Dietrich

score 0 · Answer 4 · answered Jul 26 '16 at 18:39

The following solution using the numpy_indexed package (disclaimer: I am its author) is purely vectorized, and likely to be much more efficient than the solutions posted thus far:

import numpy_indexed as npi
idx = npi.indices(ids, ordered_ids, missing='mask')
new_values = values[idx]
new_values[idx.mask] = -1   # or cast to float and set to nan, but you get the idea...

Re-order numpy array based on where its associated ids are positioned in the `master_order` array

4 Answers4

Results