2

I have a large collection (n = 20,000,000) of BigInteger, representing bit arrays of length 225. Given a single BigInteger, I want to find the x BigInteger within my collection below a certain Hamming distance.

Currently, I convert all BigInteger to byte arrays:

bHashes = new byte[hashes.Length][];
for (int i = 0; i < hashes.Length; i++)
{
    bHashes[i] = hashes[i].ToByteArray();
}

I then create a Hamming distance lookup array:

int[][] lookup = new int[256][];

for (int i = 0; i < 256; i++) {
    lookup[i] = new int[256];
    for (int j = 0; j < 256; j++)
    {
        lookup[i][j] = HammingDistance(i, j);
    }
}

static int HammingDistance(BigInteger a, BigInteger b)
{
    BigInteger n = a ^ b;

    int x = 0;
    while (n != 0)
    {
        n &= (n - 1);
        x++;
    }
    return x;
}

Finally, I calculate the total Hamming distance by calculating the sum of the Hamming distances between the bytes. My time measures have shown that "manually" adding the distances was faster than using a loop:

static List<int> GetMatches(byte[] a)
{
    List<int> result = new List<int>();
    for (int i = 0; i < bHashes.Length; i++)
    {
        byte[] b = bHashes[i];
        int dist = lookup[a[0]][b[0]] +
                   lookup[a[1]][b[1]] +
                   lookup[a[2]][b[2]] +
                   lookup[a[3]][b[3]] +
                   lookup[a[4]][b[4]] +
                   lookup[a[5]][b[5]] +
                   lookup[a[6]][b[6]] +
                   lookup[a[7]][b[7]] +
                   lookup[a[8]][b[8]] +
                   lookup[a[9]][b[9]] +
                   lookup[a[10]][b[10]] +
                   lookup[a[11]][b[11]] +
                   lookup[a[12]][b[12]] +
                   lookup[a[13]][b[13]] +
                   lookup[a[14]][b[14]];
        if (dist < THRESHOLD) result.Add(i);
    }
    return result;
}

Preprocessing time is irrelevant, only the execution time of the GetMatches() function matters. Using the method above, my system needs ~1,2s which, unfortunately, is way to long for my needs. Is there a faster way?

aseipel
  • 728
  • 7
  • 15
  • You could try a larger lookup array, to get the distance between two 16bit blocks at once for example. It's not necessarily faster but worth a try – harold Nov 18 '16 at 11:40
  • 1
    The 'for' loop in the GetMatches can be paralleized. Check out this link https://msdn.microsoft.com/en-us/library/dd460713(v=vs.110).aspx . For a simple example of Palatalization of the 'for' construct. – Thomas K Nov 18 '16 at 12:04
  • Why do you need a BigInteger? What you pass into that function will never be larger than 256. – Matt Johnson Nov 18 '16 at 13:13
  • @ThomasK Thanks, that helped a bit, still not fast enough though – aseipel Nov 18 '16 at 13:37
  • @MattJohnson I don't need BigInteger, the data just happens to be in that form. I don't pass a BigInteger into that function anyway, as I convert them to byte arrays before. – aseipel Nov 18 '16 at 13:40
  • 2
    Did you consider rewriting your algorithm for GPU (e.g. as OpenCL kernel)? As this algorithm is compute intensive, uses long vectors and is easy to paralelize, it makes it suitable for GPU and should give you order of magnitude speed up. But of course it's not trivial though. – Ňuf Nov 18 '16 at 19:43
  • 1
    BigInteger seems like the biggest bottleneck as its pretty slow in .Net – Slai Nov 18 '16 at 20:54
  • @Slai BigInteger doesn't matter. I only care about the GetMatches() function which only uses byte[] – aseipel Nov 21 '16 at 08:39
  • @Ňuf Sounds great, can you recommend any ressources (code examples, literature)? – aseipel Nov 21 '16 at 08:42
  • 1
    [CUDAfy.NET](https://cudafy.codeplex.com/) http://stackoverflow.com/questions/375011/utilizing-the-gpu-with-c-sharp – Slai Nov 21 '16 at 10:15

0 Answers0