Record first occurence of each item of a sublist in a list of lists

Question

I want to compare a list of lists where each sublist contains two strings (ID, and Time-stamp) and a sub-list of members. I have the following list of lists:

node = [['1001', '2008-01-06T02:12:13Z', ['']], 
        ['1002', '2008-01-06T02:13:55Z', ['']],  
        ['1003', '2008-01-06T02:13:00Z', ['Lion', 'Rhinoceros', 'Leopard', 'Panda']], 
        ['1004', '2008-01-06T02:15:20Z', ['Lion', 'Leopard', 'Eagle', 'Panda', 'Tiger']], 
        ['1005', '2008-01-06T02:15:48Z', ['Lion', 'Panda', 'Cheetah', 'Goat', 'Tiger']], 
        ['1006', '2008-01-06T02:13:30Z', ['']], 
        ['1007', '2008-01-06T02:13:38Z', ['Cheetah', 'Tiger', 'Goat']]]

I want create a new list of lists recording the first occurrence of each member with its ID. I want a list as follows:

output = [['1001', ''], ['1003', 'Lion'], ['1003', 'Rhinoceros'], ['1003', 'Leopard'], 
          ['1003', 'Panda'], ['1004', 'Eagle'], ['1004', 'Tiger'], ['1005', 'Cheetah']
          ['1005', 'Goat']]

I tried the following code but it halts my computer and keeps running. I have to restart the computer to get it back to senses.

output= []
# Add the first id and member
for elements in node[0][2]:
    output.append([node[0][0], elements])

for items in node[1:]:
    for members in items[2]:
        for root in output:
            if member not in root:
                output.append([items[0], member])

Appreciate any help and thanks in advance.

Because its member ' ' is already present in '1001'. I want to record the first occurrences of members only. — Hashmi, Jan 07 '18 at 23:13

Paul Rooney · Accepted Answer · 2018-01-07T23:29:06.443

2

Just loop over it, maintain a set of animals that have already been seen and only add them when it hasn't previously been seen.

Basic code:

result = []
seenanimals = set()
for ident, _, animals in node: 
    for a in animals:
        if a not in seenanimals:
            result.append([ident, a])
            seenanimals.add(a)

print(result)

Run it here.

edited Jan 07 '18 at 23:29

answered Jan 07 '18 at 23:23

Paul Rooney

20,879
9
40
61

1

Ah, a set is of course faster, when you look up previously encountered elements. +1 – Mr. T Jan 07 '18 at 23:28
This won't work if the list isn't consumed in order of `ID`. – PMende Jan 07 '18 at 23:37
@PMende why wouldnt it be? – Paul Rooney Jan 07 '18 at 23:39
@PaulRooney Who's to say? I don't know anything about the data importing/processing pipeline. It's simply something to be aware of. – PMende Jan 07 '18 at 23:41
@PMende If the OP has to consider that he can add it to his question. Otherwise I don't see any reason to be concerned about it. – Paul Rooney Jan 07 '18 at 23:45
Thanks Paul for that. My data is all in the same format. Do you recommend using the set if the data is bigger than using simple lists method described by other people here? – Hashmi Jan 08 '18 at 00:03
1

If the number of animal types is small and the number of items you have to process is smallish, then you might not see a big difference between `set` and `list`. I'd say the set version will be more futureproof, but if you measure and can conclude that the list version has better performance for you right now then there wont be any harm in using that. – Paul Rooney Jan 08 '18 at 00:24
One more thing to clarify in the above code @PaulRooney Can you explain the use of `for ident, _, animals in node:`? Why you used "_" in there? Thanks in advance – Hashmi Jan 08 '18 at 22:04
The `_` just means a variable you don't use. You can see a more complete answer [here](https://stackoverflow.com/questions/5893163/what-is-the-purpose-of-the-single-underscore-variable-in-python). `for ident, _, animals in node:` is [tuple unpacking](https://stackoverflow.com/questions/10867882/tuple-unpacking-in-for-loops), its a neater alternative to using numeric indices. Whats clearer `animals` or `item[2]`? – Paul Rooney Jan 08 '18 at 22:17

Mr. T · Answer 2 · 2018-01-07T23:42:16.577

You modify the list output, while iterating over it. Don't do this.

Probably not the most elegant way, but it works, as long as there is at least one element in the list for each ID:

node = [['1001', '2008-01-06T02:12:13Z', ['']], 
        ['1002', '2008-01-06T02:13:55Z', ['']],  
        ['1003', '2008-01-06T02:13:00Z', ['Lion', 'Rhinoceros', 'Leopard', 'Panda']], 
        ['1004', '2008-01-06T02:15:20Z', ['Lion', 'Leopard', 'Eagle', 'Panda', 'Tiger']], 
        ['1005', '2008-01-06T02:15:48Z', ['Lion', 'Panda', 'Cheetah', 'Goat', 'Tiger']], 
        ['1006', '2008-01-06T02:13:30Z', ['']], 
        ['1007', '2008-01-06T02:13:38Z', ['Cheetah', 'Tiger', 'Goat']]]

output = []
unique = []
for l in node:
    for item in l[2]:
        if item not in unique:
            output.append([l[0], item])
            unique.append(item)

print(output)

score 1 · Answer 3 · answered Jan 07 '18 at 23:36

I would iterate through the main list first in this way:

item_id_dict = {}
for sublist in node:
    for item in sublist[2]:
        if item not in item_id_dict:
            item_id_dict[item] = []
        item_id_dict[item].append(sublist[0])

If you want to avoid the if item not in item_id_dict flow control statement, you can simply use a defaultdict.

You can then get the minimum id for each item this way:

first_occurence = {
    item: min(item_id_dict[item])
    for item in item_id_dict
}

This will be a dictionary with each word of interest as its key, and the ID of the first occurrence of that word being its value. If you really need it in a list of lists (which I wouldn't recommend, as it's not an intuitive data structure for this problem), you can simply do:

output = []
for item in first_occurence.items():
    output.append(list(item))

Record first occurence of each item of a sublist in a list of lists

3 Answers3