1

I have a text file with 670,000 + lines need to process. Each line has the format of:

uid, a, b, c, d, x, y, x1, y1, t, 0,

I did some cleanning and transferred each line to a list:

[uid,(x,y,t)]

And my question is: How can I merge (x,y,t)tuples in different lists but have the common uid efficiently?

For example: I have multiple lists

[uid1,(x1,y1,t1)]
[uid1,(x2,y2,t2)]
[uid2,(x3,y3,t3)]
[uid3,(x4,y4,t4)]
[uid2,(x5,y5,t5)]
......

And I want to transfer them into:

[uid1,(x1,y1,t1), (x2,y2,z2)]
[uid2,(x3,y3,t3), (x5,52,z5)]
[uid3,(x4,y4,t4)]
......

Any help would be really appreciated.

Sayse
  • 42,633
  • 14
  • 77
  • 146
Kakaa
  • 37
  • 4
  • 1
    Could you share the code which you have tried to solve your issue? Please read the following rules to ask a question: https://stackoverflow.com/help/how-to-ask and https://stackoverflow.com/help/minimal-reproducible-example – milanbalazs Aug 15 '19 at 10:27

4 Answers4

1

You can use the groupby method from itertools. Considering you have your original lists in a variable called lists:

from itertools import groupby

lists = sorted(lists) # Necessary step to use groupby
grouped_list = groupby(lists, lambda x: x[0])
grouped_list = [(x[0], [k[1] for k in list(x[1])]) for x in grouped_list]      
ivallesp
  • 2,018
  • 1
  • 14
  • 21
1

Just use a defaultdict.

import collections

def group_items(items):
    grouped_dict = collections.defaultdict(list)
    for item in items:
        uid = item[0]
        t = item[1]
        grouped_dict[uid].append(t)

    grouped_list = []
    for uid, tuples in grouped_dict.iteritems():
        grouped_list.append([uid] + tuples)

    return grouped_list

items is a list of your initial lists. grouped_list will be a list of the grouped lists by uid.

yaswanth
  • 2,349
  • 1
  • 23
  • 33
1

If your data is stored in a dataframe, you can use .groupby to group by the 'uid', and if you transform the values (x,t,v) to a tuple ((x,t,v),), you can .sum them (i.e. concatenate them).

Here's an example:

df = pd.DataFrame.from_records(
    [['a',(1,2,3)],
    ['b',(1,2,3)],
    ['a',(10,9,8)]], columns = ['uid', 'foo']
)

df.apply({'uid': lambda x: x, 'foo': lambda x: (x,)}).groupby('uid').sum()

On my end, it produced:

uid foo
a   ((1, 2, 3), (10, 9, 8))
b   ((1, 2, 3),)
Itamar Mushkin
  • 2,803
  • 2
  • 16
  • 32
  • As a side note - please note that my solution example included a (minimal) data example that allows to demonstrate the problem and the desired result. It should've been added in the question, especially in a Python pandas question. To learn more, please visit the tour (https://stackoverflow.com/tour), how to ask (https://stackoverflow.com/help/how-to-ask) and how to ask a pandas question (https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – Itamar Mushkin Aug 15 '19 at 10:33
0

How about using defaultdict, like this:

L = [['uid1',(x1,y1,t1)],
        ['uid1',(x2,y2,t2)],
        ['uid2',(x3,y3,t3)],
        ['uid3',(x4,y4,t4)],
        ['uid2',(x5,y5,t5)]]


from collections import defaultdict

dd = defaultdict(list)

for i in L:
    dd[i[0]].append(i[1])

The output: print(dd)

defaultdict(list,
            {'uid1': [(x1, y1, t1), (x2, y2, t2)],
             'uid2': [(x3, y3, t3), (x5, y5, t5)],
             'uid3': [(x4, y4, t4)]})
A. Nadjar
  • 2,440
  • 2
  • 19
  • 20