4

I need to create a dictionary from a loop that runs through an array of 2 columns of numbers. Below is a small subset of the array:

array([[  0,   1],
       [  1,   0],
       [  1,   2],
       [  2,   3],
       [  2,   1]])

I would like to create a dictionary that takes the unique numbers of the first column as keys (e.g. {0,1,2} in this example) and the corresponding numbers in the second column as values.

For this example the dictionary would look like:

dict = {0:[1], 1:[0,2], 2:[3,1]}

My array is very long (370,000 x 2) so I would like to do this through an efficient loop. Any advice would be greatly appreciated!

smci
  • 32,567
  • 20
  • 113
  • 146
yogz123
  • 703
  • 3
  • 8
  • 25

7 Answers7

6

You can use defaultdict to accomplish this.

from collections import defaultdict
a = np.array([[  0,   1],[  1,   0],[  1,   2],[  2,   3], [  2,   1]])
d = defaultdict(list)
for x,y in a:
    d[x].append(y)
Daniel
  • 5,095
  • 5
  • 35
  • 48
2

a pure numpy solution :

b=a[np.lexsort(a.T[::-1])] # if necessary.
keys,values=b.T
uniq,steps=np.unique(keys,return_index =True)
bins=np.split(values,steps[1:])

If uniq==range(len(uniq)), let it like that : bins[key] will work, and it is the fastest way.

Else :

d=dict(zip(uniq,bins))
#{0: array([1]), 1: array([0, 2]), 2: array([1, 3])}

will build your dictionary.

B. M.
  • 18,243
  • 2
  • 35
  • 54
1

Assuming your array has been sorted by the first column, you can use groupby:

from itertools import groupby
{k: [v for _, v in g] for k, g in groupby(arr, lambda x: x[0])}
# {0: [1], 1: [0, 2], 2: [3, 1]}

#arr = np.array([[  0,   1],
#                [  1,   0],
#                [  1,   2],
#                [  2,   3],
#                [  2,   1]])
Psidom
  • 209,562
  • 33
  • 339
  • 356
1

nice one-liner to do that:

import itertools

array = [[  0,   1],
       [  1,   0],
       [  1,   2],
       [  2,   3],
       [  2,   1]]

d = {k:[i[1] for i in v] for k,v in itertools.groupby(sorted(array),lambda x : x[0])}

result:

{0: [1], 1: [0, 2], 2: [1, 3]}
  • group by first items on a sorted version of the list (in case the sort isn't already done)
  • create the dictionary in a dict comprehension building the value list with only the 2nd element of the grouped items
Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219
  • I'm getting the error "The truth value of an array with more than one element is ambiguous. Use a.any() or a.all() for this solution". Although I'm a big fan of the one-liner! – yogz123 Feb 07 '17 at 23:04
1

If your first column is a "range with repeats"

steps_at = np.searchsorted(a[:,0], np.arange(a[-1,0]+1))
result = {k:v for k,v in zip(a[steps_at,0], np.split(a[:,1], steps_at[1:]))}

If your first column has equal items clustered but not sorted

steps_at = np.where(np.diff(np.r_[np.nan, a[:,0]]))[0]
return {k:v for k,v in zip(a[steps_at,0], np.split(a[:,1], steps_at[1:]))}

General case

ind = np.argsort(a[:, 0], kind='mergesort')
aa = a[ind, 0]
steps_at = np.where(np.diff(np.r_[np.nan, aa]))[0]
return {k:v for k,v in zip(aa[steps_at], np.split(a[ind,1], steps_at[1:]))}

Shootout:

(19, 2) correctness

Psidom               {0: [0, 28, 38, 97, 99, 65, 73], 1: [64, 91, 70, 40, 9], 2: [94, 96, 69, 46], 3: [85, 15, 65]}
Daniel_Jimenez       defaultdict(<class 'list'>, {0: [0, 28, 38, 97, 99, 65, 73], 1: [64, 91, 70, 40, 9], 2: [94, 96, 69, 46], 3: [85, 15, 65]})
Jean_Francois_Fabre  {0: [0, 28, 38, 97, 99, 65, 73], 1: [64, 91, 70, 40, 9], 2: [94, 96, 69, 46], 3: [85, 15, 65]}
Alexandre_Kempf      {0: array([ 0, 28, 38, 97, 99, 65, 73]), 1: array([64, 91, 70, 40,  9]), 2: array([94, 96, 69, 46]), 3: array([85, 15, 65])}
Or_Duan              {0: [0, 28, 38, 97, 99, 65, 73], 1: [64, 91, 70, 40, 9], 2: [94, 96, 69, 46], 3: [85, 15, 65]}
Paul_Panzer_sorted   {0: array([ 0, 28, 38, 97, 99, 65, 73]), 1: array([64, 91, 70, 40,  9]), 2: array([94, 96, 69, 46]), 3: array([85, 15, 65])}
Paul_Panzer_grouped  {0: array([ 0, 28, 38, 97, 99, 65, 73]), 1: array([64, 91, 70, 40,  9]), 2: array([94, 96, 69, 46]), 3: array([85, 15, 65])}
Paul_Panzer_general  {0: array([ 0, 28, 38, 97, 99, 65, 73]), 1: array([64, 91, 70, 40,  9]), 2: array([94, 96, 69, 46]), 3: array([85, 15, 65])}
B_M_sorted           {0: array([ 0, 28, 38, 97, 99, 65, 73]), 1: array([64, 91, 70, 40,  9]), 2: array([94, 96, 69, 46]), 3: array([85, 15, 65])}
B_M_general          {0: array([ 0, 28, 38, 65, 73, 97, 99]), 1: array([ 9, 40, 64, 70, 91]), 2: array([46, 69, 94, 96]), 3: array([15, 65, 85])}

 (40194, 2) speed (seconds used for 10 repeats)

Psidom               0.4336233548820019
Daniel_Jimenez       0.3609276609495282
Jean_Francois_Fabre  0.17962428089231253
Alexandre_Kempf      3.5392782238777727
Or_Duan              0.1873011060524732
Paul_Panzer_sorted   0.08001555898226798
Paul_Panzer_grouped  0.08144942414946854
Paul_Panzer_general  0.10183193604461849
B_M_sorted           0.09192353091202676
B_M_general          0.16612185980193317

 (400771, 2) speed (seconds used for 10 repeats)

Psidom               3.968917251098901
Daniel_Jimenez       3.619185874937102
Jean_Francois_Fabre  1.7871235068887472
Or_Duan              1.9176530800759792
Paul_Panzer_sorted   0.8291062880307436
Paul_Panzer_grouped  0.8662846579682082
Paul_Panzer_general  1.0812653130851686
B_M_sorted           1.031000167131424
B_M_general          2.16174431797117
Alexandre_Kempf      513.2718367418274

code:

from collections import defaultdict
from itertools import groupby
import numpy as np
import timeit

Psidom = lambda a: {k: [v for _, v in g] for k, g in groupby(a, lambda x: x[0])}

def Daniel_Jimenez(a):
    d = defaultdict(list)
    for x,y in a:
        d[x].append(y)
    return d

Jean_Francois_Fabre = lambda a: {k:[i[1] for i in v] for k,v in groupby(a,lambda x : x[0])}

def Alexandre_Kempf(a):
    keys = a[:,0]
    items = a[:,1]
    uniqkey = np.unique(keys)
    prelist = [items[keys==i] for i in uniqkey]
    dico = {}
    for i in np.arange(len(uniqkey)):
        dico[uniqkey[i]] = prelist[i]
    return dico

def Or_Duan(a):
    default = {}
    for elm in a:
        try:
            default[elm[0]].append(elm[1])
        except KeyError:
            default[elm[0]] = [elm[1]]
    return default

def Paul_Panzer_sorted(a):
    steps_at = np.searchsorted(a[:,0], np.arange(a[-1,0]+1))
    return {k:v for k,v in zip(a[steps_at,0], np.split(a[:,1], steps_at[1:]))}

def Paul_Panzer_grouped(a):
    steps_at = np.where(np.diff(np.r_[np.nan, a[:,0]]))[0]
    return {k:v for k,v in zip(a[steps_at,0], np.split(a[:,1], steps_at[1:]))}

def Paul_Panzer_general(a):
    ind = np.argsort(a[:, 0], kind='mergesort')
    aa = a[ind, 0]
    steps_at = np.where(np.diff(np.r_[np.nan, aa]))[0]
    return {k:v for k,v in zip(aa[steps_at], np.split(a[ind,1], steps_at[1:]))}

def B_M_sorted(b):
    keys,values=b.T
    uniq,steps=np.unique(keys,return_index =True)
    bins=np.split(values,steps[1:])
    return dict(zip(uniq,bins))

def B_M_general(a):
    b=a[np.lexsort(a.T[::-1])]
    keys,values=b.T
    uniq,steps=np.unique(keys,return_index =True)
    bins=np.split(values,steps[1:])
    return dict(zip(uniq,bins))

c = np.arange(4).repeat(np.random.randint(1,10,(4)))
d = np.random.randint(100, size=c.shape)
t = np.c_[c, d]
c = np.arange(8000).repeat(np.random.randint(1,10,(8000)))
d = np.random.randint(100, size=c.shape)
a = np.c_[c, d]
c = np.arange(80000).repeat(np.random.randint(1,10,(80000)))
d = np.random.randint(100, size=c.shape)
b = np.c_[c, d]

print(t.shape, 'correctness\n')
i = 0
for f in (Psidom, Daniel_Jimenez, Jean_Francois_Fabre, Alexandre_Kempf,
          Or_Duan, Paul_Panzer_sorted, Paul_Panzer_grouped,
          Paul_Panzer_general, B_M_sorted, B_M_general):
    name = f.__name__
    if name == '<lambda>':
        name = ['Psidom', 'Jean_Francois_Fabre'][i]
        i += 1
    print(name + (20 - len(name)) * ' ', f(t))

print('\n', a.shape, 'speed (seconds used for 10 repeats)\n')
i = 0
for f in (Psidom, Daniel_Jimenez, Jean_Francois_Fabre, Alexandre_Kempf,
          Or_Duan, Paul_Panzer_sorted, Paul_Panzer_grouped,
          Paul_Panzer_general, B_M_sorted, B_M_general):
    name = f.__name__
    if name == '<lambda>':
        name = ['Psidom', 'Jean_Francois_Fabre'][i]
        i += 1
    print(name + (20 - len(name)) * ' ', timeit.timeit("f(a)",number=10,
                                                       globals={'f':f, 'a':a}))

print('\n', b.shape, 'speed (seconds used for 10 repeats)\n')
i = 0
for f in (Psidom, Daniel_Jimenez, Jean_Francois_Fabre,
          Or_Duan, Paul_Panzer_sorted, Paul_Panzer_grouped,
          Paul_Panzer_general, B_M_sorted, B_M_general, Alexandre_Kempf):
    name = f.__name__
    if name == '<lambda>':
        name = ['Psidom', 'Jean_Francois_Fabre'][i]
        i += 1
    print(name + (20 - len(name)) * ' ', timeit.timeit("f(a)",number=10,
                                                       globals={'f':f, 'a':b}))
Paul Panzer
  • 51,835
  • 3
  • 54
  • 99
  • Good job. If the first column is a "range with repeats", you can kill the lexsort and even the dict in B_M(a) ;). – B. M. Feb 07 '17 at 21:58
  • @B.M. updated it a sec ago. Timed yours with and without sorting. Left the dict in, though, who knows what OP wants to do with it. But good insight, you can skip it. – Paul Panzer Feb 07 '17 at 22:10
0

One solution would be to do (if a is your array):

keys = a[:,0]
items = a[:,1]
uniqkey = np.unique(keys)
prelist = [items[keys==i] for i in uniqkey]
dico = {}
for i in np.arange(len(uniqkey)):
    dico[uniqkey[i]] = prelist[i]
Alexandre Kempf
  • 949
  • 7
  • 9
0

Alternative to the defaultdict is to use try except and implement the "ask forgiveness not permission" approach.

array = [[  0,   1],
           [  1,   0],
           [  1,   2],
           [  2,   3],
           [  2,   1]]
default = {}
for elm in array:
    try:
        default[elm[0]].append(elm[1])
    except KeyError:
        default[elm[0]] = [elm[1]]
Community
  • 1
  • 1
Or Duan
  • 13,142
  • 6
  • 60
  • 65
  • The whole point to using `defaultdict` is streamline this forgiveness approach, at least in appearance, if not in speed. – hpaulj Feb 07 '17 at 18:18
  • @hpaulj Say what you will, but this is roughly twice as fast as the defaultdict version. And the OP asked for speed. – Paul Panzer Feb 07 '17 at 20:03
  • @hpaulj I never heard that `defaultdict` is meant to replace the forgiveness approach, it is just different one(less lines of code in this case). But the OP asked for performance so this is prefered. No need to downvote :) – Or Duan Feb 08 '17 at 08:38
  • A question from several years about the relative advantages of defaltdict and alternatives. http://stackoverflow.com/q/20664764/901925. Relative advantage of the forgiveness approach may depend on how often it has to ask for forgiveness. – hpaulj Feb 08 '17 at 09:06