4

I have a large list of tuples e.g. [ (1,2), (1,3), (1,4), (2,1), (2,3) ] etc. I want to convert it to [ (1, [1,2,3,4]), (2, [1,3] ) ] efficiently. I am grouping tuples by the first element of each tuple i.e. (1,2), (1,3), (1,4) becomes (1, [2,3,4]) (also see the Haskell version below). I doubt that this can be done in one pass? The input list is always ordered.

In python in tried using defaultdict which I thought was the natural solution without reinventing the wheel. It works well but it does not preserve the order of keys. One solution is to use ordered defaultdict as explained here.

Anyway, I'd like to know the language independent and efficient solution to this problem. My current solution requires two passes and one call to set( ) on a list.

Update

I am thinking of implementing following Haskell version:

a = [ (1,2), (1,3), (1,4), (2,1), (2,3) ] 
b = groupBy (\ x y -> fst x == fst y ) 
b 
[[(1,2),(1,3),(1,4)],[(2,1),(2,3)]]  
map (\x -> (fst .head $ x, map snd x ) ) b 
[(1,[2,3,4]),(2,[1,3])]

Performance of answers

I implemented two answers (coldspeed and pm2ring). On moderate size lists (upto 10^4 elements) PM2 ring solution is faster; at size 10^5, both takes same time, on larger list COLDSPEED starts winning. Below are the numbers (with python3).

First column is number of entries in list, second is time taken by coldspeed and third column has time taken by pm2 ring solutions. All times are in second.

10 0.0001 0.0000
100 0.0001 0.0000
1000 0.0005 0.0001
10000 0.0044 0.0014
100000 0.0517 0.0452
1000000 0.5579 1.5249

Script is here http://github.com/dilawar/playground/raw/master/Python/so_group_tuple.py

With Ashwini optimization

PM 2Ring solution is even faster (roughly 3x - 5x) with Ashwini's suggestions.

10 4.887580871582031e-05 1.2636184692382812e-05
100 0.00010132789611816406 2.0742416381835938e-05
1000 0.0005109310150146484 0.000110626220703125
10000 0.004467487335205078 0.0009067058563232422
100000 0.05056118965148926 0.017516136169433594
1000000 0.6100358963012695 0.26450490951538086
10000000 6.092756509780884 2.8253660202026367

With PYPY

Somewhat mixed results. Last column is ratio of column 2 and column 3.

pypy so_group_tuple.py 
(10, [1.6927719116210938e-05, 3.409385681152344e-05], 0.4965034965034965)
(100, [4.601478576660156e-05, 8.296966552734375e-05], 0.5545977011494253)
(1000, [0.010258913040161133, 0.0019040107727050781], 5.388054094665665)
(10000, [0.0002448558807373047, 0.00021600723266601562], 1.1335540838852096)
(100000, [0.002658843994140625, 0.0018231868743896484], 1.45834967961292)
(1000000, [0.0833890438079834, 0.02979302406311035], 2.7989452709245284)
(10000000, [1.0556740760803223, 0.6789278984069824], 1.5549133841124023)

I am going with PM 2Ring solution since it is much faster till list size 10^5.

cs95
  • 379,657
  • 97
  • 704
  • 746
Dilawar
  • 5,438
  • 9
  • 45
  • 58
  • 2
    Please include your current solution, and clarify what the problem is - it's not clear to me how you're getting from your first to your second list. – perigon Aug 03 '17 at 06:31
  • [OrderedDict](https://docs.python.org/2/library/collections.html#ordereddict-objects)? – 101 Aug 03 '17 at 06:34
  • Is the input list always ordered like that? BTW, you have a typo in that list. – PM 2Ring Aug 03 '17 at 06:34
  • Is your expected output actually `[ (1, [2,3,4]), (2, [1,3] ) ]`? I Don't know from where the `1` from the list in the first tuple comes from. – Paco H. Aug 03 '17 at 06:37
  • 1
    Thanks for adding that timing info. You should take a look at [timeit](https://docs.python.org/3/library/timeit.html), it's a little more accurate (and convenient) than doing it manually with the `time` module. – PM 2Ring Aug 03 '17 at 08:05
  • Now that dicts keep insertion order, would defaultdict work here now? – PaulMcG Dec 14 '22 at 14:55

3 Answers3

15

You can do this with itertools.groupby, and using zip to rearrange the data from the collected groups:

from itertools import groupby
from operator import itemgetter

a = [(1, 2), (1, 3), (1, 4), (2, 1), (2, 3)]
b = [(k, list(list(zip(*g))[1])) for k, g in groupby(a, itemgetter(0))]
print(b)

output

[(1, [2, 3, 4]), (2, [1, 3])]

That list comp is a bit dense. Here's a variation using a traditional for loop that prints an intermediate result to make it a little easier to see what's going on.

b = []
for k, g in groupby(a, itemgetter(0)):
    t = list(zip(*g))
    print(t)
    b.append(list(t[1]))

print('Output', b)

output

[(1, 1, 1), (2, 3, 4)]
[(2, 2), (1, 3)]
Output [[2, 3, 4], [1, 3]]

As Ashwini Chaudhary mentions in the comments, nesting another list comp in there makes the code much more readable, it's probably also more efficient, since it avoids a couple of calls.

b = [(k, [x for _, x in g]) for k, g in groupby(a, itemgetter(0))]
PM 2Ring
  • 54,345
  • 6
  • 82
  • 182
8

You may use the collections.OrderedDict (import collections first):

o = collections.OrderedDict()

for x in t:
    o.setdefault(x[0], []).append(x[1])  

Now, convert o.items() to a list:

list(o.items())
# [(1, [2, 3, 4]), (2, [1, 3])]
cs95
  • 379,657
  • 97
  • 704
  • 746
  • Though this is easy to read, it is somewhat slower than `PM 2Ring` solutions on list of size upto 10^5 - 10^6. I've added some benchmark in question body. – Dilawar Aug 03 '17 at 08:02
  • 1
    @Dilawar Performance isn't the only consideration. If you want speed use C ;) You should go with what is the most simplest, clearest, easiest to read and understand. Understandably PM 2Ring's solution works and is nice to look at, but I would want to _actually_ know what my code is doing. In the end it's up to you. Cheers. – cs95 Aug 03 '17 at 22:58
1

May be if the input list is already ordered, it is not required to use any other ordering function or feature to again reorder the list. Below code will automatically give the output as you've shown.

mylistarr = ((1, 2), (1, 3), (1, 4), (2, 1), (2, 3))
output = dict()
for tuple in mylistarr:
    if tuple[0] not in anotherlist:
        output[tuple[0]] = list()
        output[tuple[0]].append(tuple[0])
    output[tuple[0]].append(tuple[1])
print output

Output: {1: [1, 2, 3, 4], 2: [2, 1, 3]}

Milind Gokhale
  • 575
  • 2
  • 14