0

I have a data which contains the following 10 words:

[A,B,C,D,E,F,G,H,I,J]

I have a dataset which contains permutations of these words such as:

  • A,B
  • A,B,C,D
  • E,F,G
  • H ... and so on.

Most of the combinations are non-repetitive, but unfortunately, there are some which are repetitive. I want to convert those repetitive combinations such as :

  1. A,B,C,D,E
  2. C,A,B,D,E
  3. D,A,B,C,E and so on.. (for 10 elements, there would be about 9 million repetitive combinations but only 1023 non-repetitive combinations. My data has about 1700, meaning there are some repetitions )

I want to convert all these into only one unique value ( all three elements have the same words, in different order, so convert all three into lets say A,B,C,D,E only) which can be anything but has to hold true for all the values having same words. How to do this using Python?

I was able to generate the unique permutations by using this formula in python:

stuff = ['A','B','C','D','E','F','G','H','I','J']
combinations=list()
for L in range(1, len(stuff)+1):
    for subset in itertools.combinations(stuff, L):
        print(list(subset))

How do I convert those 1700 into 1023 unique values?

Kshitij Yadav
  • 1,357
  • 1
  • 15
  • 35

2 Answers2

1

You could use a set of frozensets. Assuming that dataset is a list of lists (or more generally an iterable of iterables, you could do:

resul = set((frozenset(elt) for elt in dataset))

Inner elements have to be frozenset, because a set cannot contain mutable elements.

You can convert that back to a list of lists with:

filtered_dataset = [list(elt) for elt in resul]
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
0

It appears that you're looking for the "power set" of your word list. You can easily look up how to do that with itertools here.

To number the sets, use a binary encoding of the presence or absence of each element. This gives you a direct conversion. For instance, {G, H, J} would map to 0000001101, or ID 13. You can do the conversion either way with a list comprehension, such as

bits = [int(word) for word in word_list]

Is that enough to move you along?

Prune
  • 76,765
  • 14
  • 60
  • 81
  • I actually for the sake of simplicity made it a single letter word. Its actually 3-6 letters words – Kshitij Yadav Mar 23 '20 at 21:51
  • I understand, and used the same notation. I even referred to each list element as a `word`. How does this affect the solution I outlined? – Prune Mar 23 '20 at 21:55