7

I have a list of dictionaries like this:

 data = [{'x': 1, 'y': 10},
         {'x': 3, 'y': 15},
         {'x': 2, 'y': 1},
          ... ]

I have a function (for example matplotlib.axis.plot) which needs lists of x and y values. So I have to "transpose" the dictionary".

First question: what do you call this operation? Is "transpose" the correct term?

I've tried this, but I'm searching for an efficient way (maybe there are some special numpy function):

x = range(100)
y = reversed(range(100))
d = [dict((('x',xx), ('y', yy))) for (xx, yy) in zip(x,y)]
# d is [{'y': 99, 'x': 0}, {'y': 98, 'x': 1}, ... ]

timeit.Timer("[dd['x'] for dd in d]", "from __main__ import d").timeit()
# 6.803985118865967

from operator import itemgetter
timeit.Timer("map(itemgetter('x'), d)", "from __main__ import d, itemgetter").timeit()
# 7.322326898574829

timeit.Timer("map(f, d)", "from __main__ import d, itemgetter; f=itemgetter('x')").timeit()
# 7.098556041717529

# quite dangerous
timeit.Timer("[dd.values()[1] for dd in d]", "from __main__ import d").timeit()
# 19.358459949493408

Is there a better solution? My doubt is: in these cases the hash of the string 'x'is recomputed every time?

hobs
  • 18,473
  • 10
  • 83
  • 106
Ruggero Turra
  • 16,929
  • 16
  • 85
  • 141
  • 1
    Should that be `from operator import itemgetter` ? – mgilson Sep 06 '12 at 13:17
  • Under normal circumstances, string hashes are calculated only once and stored in [`s->ob_shash`](http://hg.python.org/cpython/file/5a2ef447b80d/Objects/stringobject.c#l1258) for reuse. – senderle Sep 06 '12 at 13:20
  • Have you considered changing to a dictionary of lists instead of a list of dictionaries? Then your data access would be `d['x'][idx]` instead of `d[idx]['x']`, but that doesn't seem too difficult to switch ... (it will also probably be more memory efficient) – mgilson Sep 06 '12 at 13:21
  • @mgilson: I can't, it comes from the output of a database query – Ruggero Turra Sep 06 '12 at 13:57
  • Can you change the database query? – grieve Sep 06 '12 at 15:32
  • Also keep in mind that timeit runs each operation one million times, so even the slowest one is only taking 0.02 milliseconds per operation – grieve Sep 06 '12 at 15:38
  • @grieve: I'm using an API to access to data on a remote database, so no, I can't – Ruggero Turra Sep 06 '12 at 16:13

3 Answers3

1

Stealing the form from this answer

import timeit
from operator import itemgetter
from itertools import imap

x = range(100)
y = reversed(range(100))
d = [dict((('x',xx), ('y', yy))) for (xx, yy) in zip(x,y)]
# d is [{'y': 99, 'x': 0}, {'y': 98, 'x': 1}, ... ]
D={x:y for x,y in zip(range(10),reversed(range(10)))}


def test_list_comp(d):
    return [dd['x'] for dd in d]

def test_list_comp_v2(d):
    return [(x["x"], x["y"]) for x in d]

def testD_keys_values(d):
    return d.keys()

def test_map(d):
    return map(itemgetter('x'), d)

def test_positional(d):
    return [dd.values()[1] for dd in d]

def test_lambda(d):
    return list(imap(lambda x: x['x'], d))

def test_imap_iter(d):
    return list(imap(itemgetter('x'), d))

for test in sorted(globals()):
    if test.startswith("test_"):
        print "%30s : %s" % (test, timeit.Timer("f(d)", 
              "from __main__ import %s as f, d" % test).timeit())
for test in sorted(globals()):
    if test.startswith("testD_"):
        print "%30s : %s" % (test, timeit.Timer("f(D)", 
              "from __main__ import %s as f, D" % test).timeit())

Gives the following results:

    test_imap_iter : 8.98246016151
       test_lambda : 15.028239837
    test_list_comp : 5.53205787458
 test_list_comp_v2 : 12.1928668102
          test_map : 6.38402269826
   test_positional : 20.2046790578
 testD_keys_values : 0.305969839705

Clearly the biggest win is getting your data in a format closer to what you already need, but you may not control that.

As far as the name goes I would call it a transformation.

Community
  • 1
  • 1
grieve
  • 13,220
  • 10
  • 49
  • 61
  • why did you add `testD_keys_values`? I'm surprised for the difference between `test_map` and `test_list_comp` – Ruggero Turra Sep 06 '12 at 16:12
  • @wiso: Re testD_keys_values: Someone posted that as an answer, and then deleted it. I would guess test_map is slower, because itemgetter is returning a function (or technically a callable) that is slightly more expensive than getting the attribute directly. – grieve Sep 06 '12 at 16:17
0

If you just need to iterate over the values you can consider this method:

imap(lambda x: x['x'], d)
zenpoy
  • 19,490
  • 9
  • 60
  • 87
  • I doubt that this will perform better than any of the methods OP has already described ... – mgilson Sep 06 '12 at 13:25
  • it seems you're the winner: 0.2664310932159424 – Ruggero Turra Sep 06 '12 at 13:31
  • 1
    @wiso: I don't think that 0.27s value is a fair comparison. That's the time to construct the imap object, not the time to actually extract all the data. If you wrap it in a `list` to make it equivalent, it's pretty slow. – DSM Sep 06 '12 at 13:39
  • @wiso I didn't measure it before posting, I just thought you might gain something memory-wise by using itertools. – zenpoy Sep 06 '12 at 13:48
  • 2
    @wiso: a lot of that is `lambda` overhead. If you use `itemgetter('x')` instead it should be closer to the others (but still probably slower.) – DSM Sep 06 '12 at 13:55
  • @DSM: it's still slower, if I substitute `lambda` with `itemgetter` I get 9.2505 – Ruggero Turra Sep 06 '12 at 13:59
  • @zenpoy , I am getting `name 'imap' is not defined`. Something missing in my installation? – wander95 Aug 18 '16 at 13:40
  • @wander95 , just import the function 'from itertools import imap' – zenpoy Aug 18 '16 at 18:17
0

Why not something like this?

[(x["x"], x["y"]) for x in d]

that will return a list of tuples containing the x and y position. I'm not sure about the speed of it, but it will get rid of the lambda overhead.

matts1
  • 857
  • 1
  • 9
  • 20