0

I need to process lots of data in lists and so have been looking at what the best way of doing this is using Python.

The main ways I've come up with are using: - List comprehensions - generator expressions - functional style operations (map,filter etc.)

I know generally list comprehensions are probably the most "Pythonic" method, but what is best in terms of performance?

Jim Jeffries
  • 9,841
  • 15
  • 62
  • 103
  • That depends on what your problem is of course. – KillianDS Jun 19 '10 at 12:13
  • sorry, that was a bit general looking back at it. ill be doing lots of different things, filtering on various attributes; checking formats of a list of strings against a regex, calling various functions on each item in the list. sorry its a bit vague – Jim Jeffries Jun 19 '10 at 12:21
  • and the answers are going to be different for the different things you want to do. Did you have any more specific questions? – Duncan Jun 19 '10 at 13:59
  • Why not build simple prototypes for each problem, using the various techniques you have listed, and then do timing tests on (a subset of the) real data? Then answer your own question here, with results and sample data :) Btw., if you can avoid backtracking in your regexps that's generally faster (well, theoretically always faster...). Also try to store and compile the regexps outside the loops etc. – kaleissin Jun 19 '10 at 17:28

1 Answers1

1

Inspired by this answer: Python List Comprehension Vs. Map , I've tweaked the questions to allow generator expressions to be compared:

For built-ins:

$ python -mtimeit -s 'import math;xs=range(10)' 'sum(map(math.sqrt, xs))'
100000 loops, best of 3: 2.96 usec per loop
$ python -mtimeit -s 'import math;xs=range(10)' 'sum([math.sqrt(x) for x in xs)]'
100000 loops, best of 3: 3.75 usec per loop
$ python -mtimeit -s 'import math;xs=range(10)' 'sum(math.sqrt(x) for x in xs)'
100000 loops, best of 3: 3.71 usec per loop

For lambdas:

$ python -mtimeit -s'xs=range(10)' 'sum(map(lambda x: x+2, xs))'
100000 loops, best of 3: 2.98 usec per loop
$ python -mtimeit -s'xs=range(10)' 'sum([x+2 for x in xs])'
100000 loops, best of 3: 1.66 usec per loop
$ python -mtimeit -s'xs=range(10)' 'sum(x+2 for x in xs)'
100000 loops, best of 3: 1.48 usec per loop

Making a list:

$ python -mtimeit -s'xs=range(10)' 'list(map(lambda x: x+2, xs))'
100000 loops, best of 3: 3.19 usec per loop
$ python -mtimeit -s'xs=range(10)' '[x+2 for x in xs]'
100000 loops, best of 3: 1.21 usec per loop
$ python -mtimeit -s'xs=range(10)' 'list(x+2 for x in xs)'
100000 loops, best of 3: 3.36 usec per loop

It appears that map is best when paired with built-in functions, otherwise, generator expressions beat out list comprehensions. Along with slightly cleaner syntax, generator expressions also save much memory over list comprehensions because they are lazily evaluated. So in the absence of specific tests for your application, you should use map with builtins, a list comprehension when you require a list result, otherwise a generator. If you're really concerned with performance, you might take a look at whether you actually require lists at all points in your program.

Community
  • 1
  • 1
gilesc
  • 1,969
  • 1
  • 14
  • 16