4

I am trying to solve a problem that is a part of my genome alignment project. The problem goes as follows: if given a nested list

y = [[1,2,3],[1,2,3],[3,4,5],[6,5,4],[4,2,5],[4,2,5],[1,2,8],[1,2,3]]

extract indices of unique lists into a nested list again.

For example, the output for the above nested list should be

[[0,1,7],[2],[3],[4,5],[6]].

This is because list [1,2,3] is present in 0,1,7th index positions, [3,4,5] in 2nd index position and so on.

Since I will be dealing with large lists, what could be the most optimal way of achieving this in Python?

Georgy
  • 12,464
  • 7
  • 65
  • 73
sindhuja
  • 167
  • 1
  • 12

4 Answers4

7

You could create an dictionary (or OrderedDict if on older pythons). The keys of the dict will be tuples of the sub-lists and the values will be an array of indexes. After looping through, the dictionary values will hold your answer:

from collections import OrderedDict

y = [[1,2,3],[1,2,3],[3,4,5],[6,5,4],[4,2,5],[4,2,5],[1,2,8],[1,2,3]]

lookup = OrderedDict()
for idx,l in enumerate(y):
    lookup.setdefault(tuple(l), []).append(idx)

list(lookup.values())
# [[0, 1, 7], [2], [3], [4, 5], [6]]
Mark
  • 90,562
  • 7
  • 108
  • 148
  • _older pythons_ means: In Python 3.6 the order-preserving aspect is working and in Python 3.7 or higher it is guaranteed, see e.g. [this post](https://stackoverflow.com/q/39980323/2648551) for details. – colidyre Jan 21 '20 at 22:37
  • 1
    This solution is much quicker and elegant then mine, runs much quicker. +1 – PacketLoss Jan 21 '20 at 22:41
  • 1
    This method is very efficient even for large lists. Thank you! – sindhuja Jan 23 '20 at 19:25
3

You could use list comprehension and range to check for duplicate indexes and append them to result.

result = []
for num in range(len(y)):
    occurances = [i for i, x in enumerate(y) if x == y[num]]
    if occurances not in result: result.append(occurances)

result
#[[0, 1, 7], [2], [3], [4, 5], [6]]
PacketLoss
  • 5,561
  • 1
  • 9
  • 27
2

Consider numpy to solve this:

import numpy as np

y = [
    [1, 2, 3],
    [1, 2, 3],
    [3, 4, 5],
    [6, 5, 4],
    [4, 2, 5],
    [4, 2, 5],
    [1, 2, 8],
    [1, 2, 3]
]

# Returns unique values of array, indices of that
# array, and the indices that would rebuild the original array
unique, indices, inverse = np.unique(y, axis=0, return_index=True, return_inverse=True)

Here's a print out of each variable:

unique = [
[1 2 3]
[1 2 8]
[3 4 5]
[4 2 5]
[6 5 4]]

indices = [0 6 2 4 3]

inverse = [0 0 2 4 3 3 1 0]

If we look at our variable - inverse, we can see that we do indeed get [0, 1, 7] as the index positions for our first unique element [1,2,3], all we need to do now is group them appropriately.

new_list = []
for i in np.argsort(indices):
    new_list.append(np.where(inverse == i)[0].tolist()) 

Output:

new_list = [[0, 1, 7], [2], [3], [4, 5], [6]]

Finally, refs for the code above:

Numpy - unique, where, argsort

Flavio
  • 66
  • 3
2

One more solution:

y = [[1, 2, 3], [1, 2, 3], [3, 4, 5], [6, 5, 4], [4, 2, 5], [4, 2, 5], [1, 2, 8], [1, 2, 3]]

occurrences = {}

for i, v in enumerate(y):
    v = tuple(v)
    if v not in occurrences:
        occurrences.update({v: []})
    occurrences[v].append(i)

print(occurrences.values())
funnydman
  • 9,083
  • 4
  • 40
  • 55