0

I have very large list of list table and I need to add more columns to it.

tbl = [range(200),range(200),range(200),...]
newCol = [val1, val2]

the way I see it I can do this either:

for idx,val in enumerate(tbl)
    tbl[idx] = newCol + val

or

colRep = [newCol]*len(tbl)
mgr = itertools.izip(colRep,tbl)
newTbl = [ itertools.chain(*elem) for elem in mgr]

Is one really better than the other? Is there better way of doing this?

Anand S Kumar
  • 88,551
  • 18
  • 188
  • 176
Arthur
  • 33
  • 4
  • 2
    The former is more readable, IMO, but what do *you* mean by *"better"*? If it's a matter of performance, have you tried any testing/profiling? – jonrsharpe Jul 29 '15 at 17:06
  • yes, my primary concern is performance. Did not try to test it, I thought may be there is python theoretic answer that will resolve it – Arthur Jul 29 '15 at 17:10
  • 1
    I would go with the former unless the later was much much faster. That said , `colRep = [newCol]*len(tbl)` where `newCol` is a list tends to produce ["interesting" behavior"](http://stackoverflow.com/questions/240178/python-list-of-lists-changes-reflected-across-sublists-unexpectedly). – NightShadeQueen Jul 29 '15 at 17:10
  • There are other concerns, too - the former operates in-place on the outer list (but creates new inner lists), is that desirable? – jonrsharpe Jul 29 '15 at 17:13

1 Answers1

2

For readability, a simple list comprehension would do:

In [28]: tbl = [range(2),range(3),range(4)]
In [29]: [newCol + list(elt) for elt in tbl]
Out[29]: 
[['val1', 'val2', 0, 1],
 ['val1', 'val2', 0, 1, 2],
 ['val1', 'val2', 0, 1, 2, 3]]

Note that in Python3, range returns a range object, not a list. So to make the code Python2- and Python3-compatible, I changed newCol + elt to newCol + list(elt).

If you wish to modify tbl in-place, you could use

tbl[:] = [newCol + list(elt) for elt in tbl]

Note that before we can compare performance, we need to pin down what is the desired result, lest we end up comparing apples to oranges.

The for-loop modifies tbl inplace. Is the inplace-ness important?

The zip/chain code does not modify tbl in-place and instead produces a list of iterators:

In [47]: newTbl
Out[47]: 
[<itertools.chain at 0x7f5aeb0a6750>,
 <itertools.chain at 0x7f5aeb0a6410>,
 <itertools.chain at 0x7f5aeb0a6310>]

That could be what you want, but it would be unfair to compare the performance of these two pieces of code, because the iterators delay the process of enumerating the items inside the iterators. It would be like timing the difference between painting a house and contemplating painting a house.

To make the comparison more fair, we could use list to consume the iterator:

newTbl = [ list(itertools.chain(*elem)) for elem in mgr]

To benchmark the performance of the various options, you could use timeit like this:

import timeit
import itertools

tbl = [range(2),range(3),range(4)]
newCol = ['val1', 'val2']

stmt = {
    'for_loop' : '''\
for idx,val in enumerate(tbl):
    tbl[idx] = newCol + val
''',
    'list_comp': '''tbl = [newCol + elt for elt in tbl]''',
    'inplace_list_comp': '''tbl[:] = [newCol + elt for elt in tbl]''',
    'zip_chain': '''
colRep = [newCol]*len(tbl)
mgr = itertools.izip(colRep,tbl)
newTbl = [ list(itertools.chain(*elem)) for elem in mgr]
'''

}
for s in ('for_loop', 'list_comp', 'inplace_list_comp', 'zip_chain'):
    t = timeit.timeit(
        stmt[s], 
        setup='from __main__ import newCol, itertools; tbl = [range(200)]*10**5',
        number=10)
    print('{:20}: {:0.2f}'.format(s, t))

yields

for_loop            : 1.12
list_comp           : 1.21
inplace_list_comp   : 1.26
zip_chain           : 4.40

So the for_loop may be marginally faster. Be sure to check this with tbl closer to you actual use case. timeit results may differ for a number of reasons, including hardware, OS, and software versions.

Also be aware that this might be senseless pre-optimization if this little piece of code is not a significant bottleneck in your actual code. For example, if your actual code spends 1.21 seconds in this list comprehension and 1000 seconds elsewhere, a tenth of a second improvement here would be insignificant overall.

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • i really like this. I only used range as an example; yes, some tables are all integers, yet others are strings. – Arthur Jul 29 '15 at 17:24