3

I have several sorted lists, and I want to add them together into one big sorted list. What is the most efficient way to do this?

Here is what I would do, but it is too inefficient:

big_list=[]
for slist in sorted_lists: # sorted_lists is a generator, so lists have to be added one by one
    big_list.extend(slist)
big_list.sort()

Here is an example for sorted_lists:

The size of sorted_lists =200

Size of first element of sorted_lists=1668

sorted_lists=[
['000008.htm_181_0040_0009', '000008.htm_181_0040_0037', '000008.htm_201_0041_0031', '000008.htm_213_0029_0004', '000008.htm_263_0015_0011', '000018.htm_116_0071_0002', '000018.htm_147_0046_0002', '000018.htm_153_0038_0015', '000018.htm_160_0060_0001', '000018.htm_205_0016_0002', '000031.htm_4_0003_0001', '000032.htm_4_0003_0001', '000065.htm_5_0013_0005', '000065.htm_8_0008_0006', '000065.htm_14_0038_0036', '000065.htm_127_0016_0006', '000065.htm_168_0111_0056', '000072.htm_97_0016_0012', '000072.htm_175_0028_0020', '000072.htm_188_0035_0004'….],
['000018.htm_68_0039_0030', '000018.htm_173_0038_0029', '000018.htm_179_0042_0040', '000018.htm_180_0054_0021', '000018.htm_180_0054_0031', '000018.htm_182_0025_0023', '000018.htm_191_0041_0010', '000065.htm_5_0013_0007', '000072.htm_11_0008_0002', '000072.htm_14_0015_0002', '000072.htm_75_0040_0021', '000079.htm_11_0005_0000', '000079.htm_14_0006_0000', '000079.htm_16_0054_0006', '000079.htm_61_0018_0012', '000079.htm_154_0027_0011', '000086.htm_8_0003_0000', '000086.htm_9_0030_0005', '000086.htm_11_0038_0004', '000086.htm_34_0031_0024'….],
['000001.htm_13_0037_0004', '000008.htm_48_0025_0006', '000008.htm_68_0025_0008', '000008.htm_73_0024_0014', '000008.htm_122_0034_0026', '000008.htm_124_0016_0005', '000008.htm_144_0046_0030', '000059.htm_99_0022_0012', '000065.htm_69_0045_0017', '000065.htm_383_0026_0020', '000072.htm_164_0030_0002', '000079.htm_122_0030_0009', '000079.htm_123_0049_0015', '000086.htm_13_0037_0004', '000109.htm_71_0054_0029', '000109.htm_73_0035_0005', '000109.htm_75_0018_0004', '000109.htm_76_0027_0013', '000109.htm_101_0030_0008', '000109.htm_134_0036_0030']]

EDIT

Thank you for the answers. I think I should have made it more clear that I do not have the sorted lists simulateneosly but I am iterating over some large files to get them. So, I need to add them one by one, as I am showing in my crude code above.

hmghaly
  • 1,411
  • 3
  • 29
  • 47
  • Have a look to this answer :) http://stackoverflow.com/a/1158196/2936488 – Nicolas Brugneaux Oct 30 '13 at 15:33
  • possible duplicate of [Combining two sorted lists in Python](http://stackoverflow.com/questions/464342/combining-two-sorted-lists-in-python) – Robᵩ Oct 30 '13 at 15:49
  • I don't think it is duplicate, please have a look at the edit – hmghaly Oct 30 '13 at 16:08
  • 1
    If you have described the constraint correctly, you already have the optimal solution. That is, if you can't have all of the lists (or at least all of the lists' generators) active simultaneously, then the best you can do is put everything into one big pile and sort it. – Robᵩ Oct 30 '13 at 16:50

3 Answers3

6

The standard library provides heapq.merge for this purpose:

>>> a=[1,3,5,6]
>>> b=[2,4,6,8]
>>> c=[2.5,4.5]
>>> list(heapq.merge(a,b,c))
[1, 2, 2.5, 3, 4, 4.5, 5, 6, 6, 8]
>>> 

Or, in your case:

big_list = list(heapq.merge(*sorted_lists))

Note that you don't have to create the list, since heapq.merge returns an iterable:

for item in heapq.merge(*sorted_lists):

Quoting the doc:

Similar to sorted(itertools.chain(*iterables)) but returns an iterable, does not pull the data into memory all at once, and assumes that each of the input streams is already sorted (smallest to largest).

Robᵩ
  • 163,533
  • 20
  • 239
  • 308
  • 1
    Thanks for the extra info. `heapq.merge` doesn't require all of the input lists to be in memory simultaneously, but it does require that all of the input iterators to exist simultaneously. So: `heapq.merge(open("filename{}".format(i) for i in range(200))` will open all two hundred files simultaneously, won't read all of any particular file simultaneously. – Robᵩ Oct 30 '13 at 16:19
3

Use the heapq module to track which list to pick the next sorted value from:

import heapq

def merge(*iterables):
    h = []
    for it in map(iter, iterables):
        try:
            next = it.next
            h.append([next(), next])
        except StopIteration:
            pass
    heapq.heapify(h)

    while True:
        try:
            while True:
                v, next = s = h[0]
                yield v
                s[0] = next()
                heapq._siftup(h, 0)
        except StopIteration:
            heapq.heappop(h)
        except IndexError:
            return

This pushes all lists unto a heap, kept sorted by their next value. Every time this yields the lowest value, the heap is updated with the next value from the iterable used and reorders the heap again.

This in essence keeps a list of [next_value, iterable] lists, and these are sorted efficiently by next_value.

Usage:

for value in merge(*sorted_lists):
    # loops over all values in `sorted_lists` in sorted order

or

big_list = list(merge(*sorted_lists))

to create a new big list with all values sorted, efficiently.

This exact implementation was added to the heapq module as the heapq.merge() function so you can just do:

from heapq import merge

big_list = list(merge(*sorted_lists))
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Hi I'm trying this now, what's h_append in this line: h_append([next(), next]) – hmghaly Oct 30 '13 at 15:45
  • @hmghaly: that was a typo; should be `h.append`; already corrected. – Martijn Pieters Oct 30 '13 at 15:46
  • @hmghaly: Files are iterators too; as you can see from the `for value in merge(...)` loop, you can produce values as you go along, only needing to hold the next values for each iterator in memory. – Martijn Pieters Oct 30 '13 at 17:18
  • @hmghaly: If you need to pre-process the lines from files to make them sortable, I wrote a [iterator merge function](http://stackoverflow.com/a/14465236) that takes a key function to produce sortable values. – Martijn Pieters Oct 30 '13 at 17:20
0
def merge_lists(*args):
   new_list = sorted(list(heapq.merge(*args)))
   print(new_list)
DocJ457
  • 797
  • 7
  • 17