Select the most different elements in an array

Question

I have a file text with P random entries in Binary (or Hex) for processing, from that P number, I have to take N entries such that they are the most different possible between them so i have a good representative of the possible population.

So far, I have think of do a comparison between the current N, and a average of the array that contains the elements using a modified version of the algorithm in: How do I calculate similarity of two integers?

or having a cumulative score of similarity (the higher the most different) between the next element to be selected and all the elements in the array, and choose the next one, and repeat until have selected the required N

I do not know if there is a better solution to this.

Ex.

[00011111, 00101110, 11111111, 01001010 , 00011000, 10010000, 01110101]

P = 7 N = 3

Result: [00011111, 10010000, 00101110]

Thanks in advance

Are you sure you need to do that? For what will you be using these items? — Jim Mischel, May 17 '16 at 12:45
The most different pair is pretty clear, but what exactly do you want to happen for N>2? Maximize the sum of pairwise distances? — harold, May 17 '16 at 13:06

Lks · Answer 1 · 2016-05-18T09:22:58.187

You could calculate the Hamming distances for all combinations if you want to choose the most different binary representation (see https://en.wikipedia.org/wiki/Hamming_distance ).

Edit: small hack

import numpy as np

a = [0b00011111, 0b00101110, 0b11111111, 0b01001010, 0b00011000, 0b10010000, 0b01110101]
N = 3

b = []
for i in a:
    b.append(np.unpackbits(np.uint8(i))) #to binary representation


valuesWithBestOverallDiffs = []

def getItemWithBestOverallDiff(b):
    itemWithBestOverallDiff = [0, 0] #idx, value

    for biter, bval in enumerate(b):
        hammDistSum = 0
        for biter2, bval2 in enumerate(b):
            if biter == biter2:
                continue
            print("iter1: " + str(biter) + " = " + str(bval))
            print("iter2: " + str(biter2) + " = " + str(bval2))
            hammDist = len(np.bitwise_xor(bval, bval2).nonzero()[0])
            print(" => " + str(hammDist))
            hammDistSum = hammDistSum + hammDist

        if hammDistSum > itemWithBestOverallDiff[1]:
            itemWithBestOverallDiff = [biter, hammDistSum]

    #print(itemWithBestOverallDiff)
    return itemWithBestOverallDiff


for i in range(N):
    itemWithBestOverallDiff = getItemWithBestOverallDiff(b)
    print("adding item nr " + str(itemWithBestOverallDiff[0]) + " with value 0b" + str(b[itemWithBestOverallDiff[0]]) + " = " + str(a[itemWithBestOverallDiff[0]]))
    val = a.pop(itemWithBestOverallDiff[0])
    b.pop(itemWithBestOverallDiff[0])
    valuesWithBestOverallDiffs.append(val)

print("result: ")
print(valuesWithBestOverallDiffs)

The final output is result: [144, 117, 255]

which is 0b10010000, 0b01110101, 0b11111111

You mean, calculate the Hamming distance of every element against each other, take it as a sum, then select the N highest? — Colanta, May 17 '16 at 21:38

score 0 · Answer 2 · answered May 17 '16 at 12:57

0

You should compare them Pairwise. this comparison problem is Shortest common supersequence problem (see this). a shortest common supersequence of strings x and y is a shortest string z such that both x and y are subsequences of z. The shortest common supersequence is a problem closely related to the longest common subsequence (see enter link description here). Best solution for the longest common subsequence is dynamic programming method.

answered May 17 '16 at 12:57

Ali Soltani

9,589
5
30
55

@Lks please upvote and accept this answer if this answer helps you, so people know this is the correct answer. – Ali Soltani Jul 06 '16 at 18:48

Select the most different elements in an array

2 Answers2