0

I have a file text with P random entries in Binary (or Hex) for processing, from that P number, I have to take N entries such that they are the most different possible between them so i have a good representative of the possible population.

So far, I have think of do a comparison between the current N, and a average of the array that contains the elements using a modified version of the algorithm in: How do I calculate similarity of two integers?

or having a cumulative score of similarity (the higher the most different) between the next element to be selected and all the elements in the array, and choose the next one, and repeat until have selected the required N

I do not know if there is a better solution to this.

Ex.

[00011111, 00101110, 11111111, 01001010 , 00011000, 10010000, 01110101]

P = 7 N = 3

Result: [00011111, 10010000, 00101110]

Thanks in advance

Community
  • 1
  • 1
Colanta
  • 87
  • 1
  • 6

2 Answers2

0

You could calculate the Hamming distances for all combinations if you want to choose the most different binary representation (see https://en.wikipedia.org/wiki/Hamming_distance ).

Edit: small hack

import numpy as np

a = [0b00011111, 0b00101110, 0b11111111, 0b01001010, 0b00011000, 0b10010000, 0b01110101]
N = 3

b = []
for i in a:
    b.append(np.unpackbits(np.uint8(i))) #to binary representation


valuesWithBestOverallDiffs = []

def getItemWithBestOverallDiff(b):
    itemWithBestOverallDiff = [0, 0] #idx, value

    for biter, bval in enumerate(b):
        hammDistSum = 0
        for biter2, bval2 in enumerate(b):
            if biter == biter2:
                continue
            print("iter1: " + str(biter) + " = " + str(bval))
            print("iter2: " + str(biter2) + " = " + str(bval2))
            hammDist = len(np.bitwise_xor(bval, bval2).nonzero()[0])
            print(" => " + str(hammDist))
            hammDistSum = hammDistSum + hammDist

        if hammDistSum > itemWithBestOverallDiff[1]:
            itemWithBestOverallDiff = [biter, hammDistSum]

    #print(itemWithBestOverallDiff)
    return itemWithBestOverallDiff


for i in range(N):
    itemWithBestOverallDiff = getItemWithBestOverallDiff(b)
    print("adding item nr " + str(itemWithBestOverallDiff[0]) + " with value 0b" + str(b[itemWithBestOverallDiff[0]]) + " = " + str(a[itemWithBestOverallDiff[0]]))
    val = a.pop(itemWithBestOverallDiff[0])
    b.pop(itemWithBestOverallDiff[0])
    valuesWithBestOverallDiffs.append(val)

print("result: ")
print(valuesWithBestOverallDiffs)

The final output is result: [144, 117, 255]

which is 0b10010000, 0b01110101, 0b11111111

Lks
  • 126
  • 5
  • You mean, calculate the Hamming distance of every element against each other, take it as a sum, then select the N highest? – Colanta May 17 '16 at 21:38
0

You should compare them Pairwise. this comparison problem is Shortest common supersequence problem (see this). a shortest common supersequence of strings x and y is a shortest string z such that both x and y are subsequences of z. The shortest common supersequence is a problem closely related to the longest common subsequence (see enter link description here). Best solution for the longest common subsequence is dynamic programming method.

Ali Soltani
  • 9,589
  • 5
  • 30
  • 55