1

Can anyone help to explain to me how can I do counting from 2 arrays without any iteration (e.g using numpy)?

Example: I have two numpy arrays, Origin and destiation. Origin and destination can have the same value. Let say I have 6 items in my array

origin = np.array(['LA', 'SF', 'NY', 'NY', 'LA', 'LA'])

dest = np.array(['SF', 'NY', 'NY', 'SF', 'LA', 'LA'])

The first item is from LA-SF, second SF-NY, third NY-NY, and so on.

The result that I want is

array([[1, 0, 1],
       [0, 2, 1],
       [1, 0, 0]])

where the row refers to origin, first being NY, second being LA, and third being SF, and the column refers to the destination with the same order.

Thank you!

Jack
  • 25
  • 8

2 Answers2

1

You can use np.unique(,return_inverse=1) and np.add.at to do that

def comm_mtx(origin, dest, keys = None):  # keys -> np.array of strings   
    if keys.size:
        o_lbl = d_lbl = keys
        k_sort = np.argsort(keys)
        o_idx = np.searchsorted(keys, origin, sorter = k_sort)
        d_idx = np.searchsorted(keys, dest, sorter = k_sort)
        o_idx = np.arange(o_idx.size)[k_sort][o_idx]
        d_idx = np.arange(d_idx.size)[k_sort][d_idx]
    else:
        o_lbl, o_idx = np.unique(origin, return_inverse = 1)
        d_lbl, d_idx = np.unique(dest,   return_inverse = 1)
    out = np.zeros((o_lbl.size, d_lbl.size))
    np.add.at(out, (o_idx, d_idx), 1)
    if keys.size:
        return out
    else:
        return o_lbl, d_lbl, out

Depending on the sparsity of out, you may want to use a scipy.sparse.coo_matrix instead

from scipy.sparse import coo_matrix as coo
def comm_mtx(origin, dest):    
    o_lbl, o_idx = np.unique(origin, return_inverse = 1)
    d_lbl, d_idx = np.unique(dest,   return_inverse = 1)
    return o_lbl, d_lbl, coo((np.ones(origin.shape), (o_idx, d_idx)), shape = (o_lbl.size, d_lbl.size))
Daniel F
  • 13,620
  • 2
  • 29
  • 55
  • This answer is wrong because OP said "where the row refers to origin, first being NY, second being LA, and third being SF, and the column refers to the destination with the same order", and `np.unique` does not give you this order. – Tom Wyllie Aug 02 '17 at 10:25
  • Although if OP changes his mind and decides he actually doesn't need this then this answer is correct and better than mine :) – Tom Wyllie Aug 02 '17 at 10:26
  • Quite right, let me see if I can come up with something better than yours :P – Daniel F Aug 02 '17 at 10:33
  • Please do, I'm sure there's more num-pythonic way than using a dictionary to map keys. Also, good shout with the sparse matrix! – Tom Wyllie Aug 02 '17 at 10:35
  • 1
    The sparse matrix idea is really really cool. I decided to accept this answer because the order does not really matter to me. Thank you! – Jack Aug 02 '17 at 10:53
  • Then I'd suggest at least returning `o_lbl` and `d_lbl` so you can track which rows and columns are which. I edited the answer to reflect this. – Daniel F Aug 02 '17 at 10:55
  • Also swiped Divakar's answer [here](https://stackoverflow.com/questions/33529593/how-to-use-a-dictionary-to-translate-replace-elements-of-an-array) to create the indices without mapping through a dictionary (which should be much faster than the `np.vectorize` @TomWyllie uses, which will just make a `for` loop). – Daniel F Aug 02 '17 at 11:22
0

To achieve what you've asked, which is to have the output matrix with the rows corresponding to the keys in a specific order, you could use a dictionary to map each unique element to a row index.

origin = np.asarray(['LA', 'SF', 'NY', 'NY', 'LA', 'LA'])
dest = np.asarray(['SF', 'NY', 'NY', 'SF', 'LA', 'LA'])

matrix_map = {'NY': 0, 'LA': 1, 'SF': 2}
stacked_inputs = np.vstack((origin, dest))
remapped_inputs = np.vectorize(matrix_map.get)(stacked_inputs)

output_matrix = np.zeros((len(matrix_map), len(matrix_map)), dtype=np.int16)
np.add.at(output_matrix, (remapped_inputs[0], remapped_inputs[1]), 1)
print(output_matrix)

Which outputs;

[[1 0 1]
 [0 2 1]
 [1 0 0]]

as desired.


Alternatively if you do not wish to hard code matrix_map beforehand, you could build it programmatically as follows;

stacked_inputs = np.vstack((origin, dest))

matrix_map = {}
for element in stacked_inputs.flatten():
    matrix_map.setdefault(element, len(matrix_map))
print(matrix_map)

remapped_inputs = np.vectorize(matrix_map.get)(stacked_inputs)

This would not give you the desired order, but would allow you to use the dictionary to easily map which row / column relates to which token.

Tom Wyllie
  • 2,020
  • 13
  • 16