1

Given a list of lists

lol = [[0,a], [0,b],
       [1,b], [1,c],
       [2,d], [2,e],
       [2,g], [2,b],
       [3,e], [3,f]]

I would like to extract all sublists that have the same last element (lol[n][1]) and end up with something like below:

[0,b]
[1.b]
[2,b]
[2,e]
[3,e]

I know that given two lists we can use an intersection, what is the right way to go about a problem like this, other than incrementing the index value in a for each loop?

Bob R
  • 605
  • 1
  • 13
  • 25
  • Go through each sub-list inside the list: Extract the last element of the sub-list ('a`, `b` in your example), and append it to a dictionary of list with the extracted element. Eg `{'a': [[0, 'a']], 'b': [[0, 'b'],[1, 'b']]}`. – yoonghm Nov 24 '21 at 01:03
  • I have added multiple ways you can do this in a pythonic way. Do check them out. – Akshay Sehgal Nov 24 '21 at 01:29

1 Answers1

2

1. Using collections.defaultdict

You can use defaultdict to the first group up your items with more than one occurrence, then, iterate over the dict.items to get what you need.

from collections import defaultdict


lol = [[0,'a'], [0,'b'],
       [1,'b'], [1,'c'],
       [2,'d'], [2,'e'],
       [2,'g'], [2,'b'],
       [3,'e'], [3,'f']]


d = defaultdict(list)

for v,k in lol:
    d[k].append(v)

# d looks like - 
# defaultdict(list,
#             {'a': [0],
#              'b': [0, 1, 2],
#              'c': [1],
#              'd': [2],
#              'e': [2, 3],
#              'g': [2],
#              'f': [3]})
    
result = [[v,k] for k,vs in d.items() for v in vs if len(vs)>1]
print(result)
[[0, 'b'], [1, 'b'], [2, 'b'], [2, 'e'], [3, 'e']]

2. Using pandas.duplicated

Here is how you can do this with Pandas -

  1. Convert to pandas dataframe
  2. For key column, find the duplicates and keep all of them
  3. Convert to list of records while ignoring index
import pandas as pd

df = pd.DataFrame(lol, columns=['val','key'])
dups = df[df['key'].duplicated(keep=False)]
result = list(dups.to_records(index=False))
print(result)
[(0, 'b'), (1, 'b'), (2, 'e'), (2, 'b'), (3, 'e')]

3. Using numpy.unique

You can solve this in a vectorized manner using numpy -

  1. Convert to numpy matrix arr
  2. Find unique elements u and their counts c
  3. Filter list of unique elements that occur more than once dup
  4. Use broadcasting to compare the second column of the array and take any over axis=0 to get a boolean which is True for duplicated rows
  5. Filter the arr based on this boolean
import numpy as np

arr = np.array(lol)

u, c = np.unique(arr[:,1], return_counts=True)
dup = u[c > 1]

result = arr[(arr[:,1]==dup[:,None]).any(0)]
result
array([['0', 'b'],
       ['1', 'b'],
       ['2', 'e'],
       ['2', 'b'],
       ['3', 'e']], dtype='<U21')
Akshay Sehgal
  • 18,741
  • 3
  • 21
  • 51
  • Do you mind explaining this line, What do you call this type of linear nesting? I assume that k,vs in d.item would be the outermost for loop if written as nested loops in a more traditional way? result = [[v,k] for k,vs in d.items() for v in vs if len(vs)>1 – Bob R Nov 24 '21 at 03:24
  • Sure, check for `nested loops in list comprehension`. The general structure is `[item for sublist in list for item in sublist]` rather than `[item for item in sublist for sublist in list]`. – Akshay Sehgal Nov 24 '21 at 08:37
  • So i am iterating over the items in the grouped dictionary, and for each element in values of the dict (`vs` which is a list) , i am again iterating over the sublist to get `v`, with a condition that `len(vs)>1` for duplicates. Then i am simply coupling it with the corresponding `k` and returning it as a list of lists. – Akshay Sehgal Nov 24 '21 at 08:40
  • Hope that clarifies what you asked. Let me know if any confusion. Also, do feel free to mark the answer if it helped solve your query. Thanks! – Akshay Sehgal Nov 24 '21 at 08:40