1

I am looking for a function that makes a new array of values based on ordered_ids, when the array has a length of one million.

Input:

    >>> ids=array(["WYOMING01","TEXAS01","TEXAS02",...])
    >>> values=array([12,20,30,...])
    >>> ordered_ids=array(["TEXAS01","TEXAS02","ALABAMA01",...])

Output:

    ordered [  20 , 30 , nan , ...]

Closing Summary

@Dietrich's use of a dictionary in list comprehension is 10x faster than using numpy index search (numpy.where). I compared the times of three results in my answer below.

b_dev
  • 2,568
  • 6
  • 34
  • 43
  • As a reaction to comment by Jamie, could you clarify if your ids are strings and do they have to be strings? And is your master_ids array equivalent to np.arange(n) up to a type or does it have missing values? – Martin Feb 24 '14 at 09:06
  • Why are you working with numpy arrays? Why not a simple list of strings? – hpaulj Mar 01 '14 at 21:54
  • In your example, master_order_ids appears to be a sorted list. If that's always the case, you wouldn't need it and your problem would be much simpler. – w-m Mar 02 '14 at 23:16
  • you might be interested in this question: http://stackoverflow.com/questions/3403973/fast-replacement-of-values-in-a-numpy-array – Noah Mar 03 '14 at 06:38

4 Answers4

1

You could try:

import numpy as np

def order_array(ids, values, master_order_ids):
    n = len(master_order_ids)
    idx = np.searchsorted(master_order_ids, ids)
    ordered_values = np.zeros(n)
    ordered_values[idx < n] = values[idx < n]
    print "ordered", ordered_values
    return ordered_values

Searchsorted gives you indices where you should insert ids into master_order_ids to keep the arrray ordered. Then you just drop those (idx, values) that are out of the range of master_order_ids.

Martin
  • 1,040
  • 9
  • 7
0

You could try using a dict() to associate the stings to your numbers. It simplifies the code considerably:

import numpy as np

def order_bydict(ids,values,master_order_ids):
    """ Using a dict to order ``master_order_ids`` """

    dd = dict([(k,v) for k,v in zip(ids, values)])  # create the dict
    ordered_values = [dd.get(m, 0) for m in master_order_ids]  # get() return 0 if key not found

    return np.asarray(ordered_values)  # return a numpy array instead of a list

The speedwise improvement is hard to predict without testing longer arrays (with your example it was 25% faster based on %timeit).

Dietrich
  • 5,241
  • 3
  • 24
  • 36
  • Thanks. My test (see in my answer below) with 10,000 elements in the arrays shows that your method is much better: not using dictionary: 1.32 ,using dictionary: 1.32 ,using dictionary with list comprehension: 0.013 – b_dev Mar 04 '14 at 02:37
0
import numpy
from numpy import copy, random, arange
import time

# SETUP    
N=10**4
ids = arange(0,N).astype(str)
values = arange(0,N)
numpy.random.shuffle(ids)
numpy.random.shuffle(values)
ordered_ids=arange(0,N).astype(str)


ordered_values = numpy.empty((N,1))
ordered_values[:] = numpy.NAN

# METHOD 1
start = time.clock()
for i in range(len(values)):ordered_values[ordered_ids==ids[i]]=values[i]
print "not using dictionary:", time.clock() - start

# METHOD 2
start = time.clock()
d = dict(zip(ids, values))
for k, v in d.iteritems(): ordered_values[ordered_ids==k] = v
print "using dictionary:", time.clock() - start

# METHOD 3 @Dietrich's approach in the answer above
start = time.clock()
dd = dict(zip(ids, values))
ordered_values = [dd.get(m, 0) for m in ordered_ids]
print "using dictionary with list comprehension:", time.clock() - start

Results

not using dictionary: 1.320237 # Method 1
using dictionary: 1.327119 # Method 2
using dictionary with list comprehension: 0.013287 # @Dietrich
b_dev
  • 2,568
  • 6
  • 34
  • 43
0

The following solution using the numpy_indexed package (disclaimer: I am its author) is purely vectorized, and likely to be much more efficient than the solutions posted thus far:

import numpy_indexed as npi
idx = npi.indices(ids, ordered_ids, missing='mask')
new_values = values[idx]
new_values[idx.mask] = -1   # or cast to float and set to nan, but you get the idea...
Eelco Hoogendoorn
  • 10,459
  • 1
  • 44
  • 42