1. Using collections.defaultdict
You can use defaultdict
to the first group up your items with more than one occurrence, then, iterate over the dict.items
to get what you need.
from collections import defaultdict
lol = [[0,'a'], [0,'b'],
[1,'b'], [1,'c'],
[2,'d'], [2,'e'],
[2,'g'], [2,'b'],
[3,'e'], [3,'f']]
d = defaultdict(list)
for v,k in lol:
d[k].append(v)
# d looks like -
# defaultdict(list,
# {'a': [0],
# 'b': [0, 1, 2],
# 'c': [1],
# 'd': [2],
# 'e': [2, 3],
# 'g': [2],
# 'f': [3]})
result = [[v,k] for k,vs in d.items() for v in vs if len(vs)>1]
print(result)
[[0, 'b'], [1, 'b'], [2, 'b'], [2, 'e'], [3, 'e']]
2. Using pandas.duplicated
Here is how you can do this with Pandas -
- Convert to pandas dataframe
- For key column, find the duplicates and keep all of them
- Convert to list of records while ignoring index
import pandas as pd
df = pd.DataFrame(lol, columns=['val','key'])
dups = df[df['key'].duplicated(keep=False)]
result = list(dups.to_records(index=False))
print(result)
[(0, 'b'), (1, 'b'), (2, 'e'), (2, 'b'), (3, 'e')]
3. Using numpy.unique
You can solve this in a vectorized manner using numpy -
- Convert to numpy matrix
arr
- Find unique elements
u
and their counts c
- Filter list of unique elements that occur more than once
dup
- Use broadcasting to compare the second column of the array and take any over axis=0 to get a boolean which is True for duplicated rows
- Filter the
arr
based on this boolean
import numpy as np
arr = np.array(lol)
u, c = np.unique(arr[:,1], return_counts=True)
dup = u[c > 1]
result = arr[(arr[:,1]==dup[:,None]).any(0)]
result
array([['0', 'b'],
['1', 'b'],
['2', 'e'],
['2', 'b'],
['3', 'e']], dtype='<U21')