1

The following is the code which I've written to implement Flajolet and Martin’s Algorithm. I've used Jenkins hash function to generate a 32 bit hash value of data. The program seems to follow the algorithm but is off the mark by about 20%. My data set consists of more than 200,000 unique records whereas the program outputs about 160,000 unique records. Please help me in understanding the mistake(s) being made by me. The hash function is implemented as per Bob Jerkins' website.

import numpy as np
from jenkinshash import jhash

class PCSA():
    def __init__(self, nmap, maxlength):
        self.nmap = nmap
        self.maxlength = maxlength
        self.bitmap = np.zeros((nmap, maxlength), dtype=np.int)

    def count(self, data):
        hashedValue = jhash(data)
        indexAlpha = hashedValue % self.nmap
        ix = hashedValue / self.nmap
        ix = bin(ix)[2:][::-1]       
        indexBeta = ix.find("1")    #find index of lsb
        if self.bitmap[indexAlpha, indexBeta] == 0:
            self.bitmap[indexAlpha, indexBeta] = 1


    def getCardinality(self):
        sumIx = 0
        for row in range(self.nmap):
            sumIx += np.where(self.bitmap[row, :] == 0)[0][0]

        A = sumIx / self.nmap

        cardinality = self.nmap * (2 ** A)/ MAGIC_CONST

        return cardinality
templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065
HumptyDumptyEIZ
  • 374
  • 1
  • 6
  • 18
  • 1
    Why don't you use the more recent version of this algorithm, HyperLogLog? There is an [example here](https://github.com/juanplopes/sketches/blob/master/hyperloglog.py). – Juan Lopes Jan 22 '15 at 20:39

1 Answers1

1

If you are running this in Python2, then the division to calculate A may result in A being changed to an integer.

If this is the case, you could try changing:

A = sumIx / self.nmap

to

A = float(sumIx) / self.nmap
Peter de Rivaz
  • 33,126
  • 4
  • 46
  • 75
  • Thank you Peter. Casting sumIx into float did give a precise output. – HumptyDumptyEIZ Jan 22 '15 at 20:10
  • There is still one problem - the number of distinct values estimated by my program comes out to be more than the actual number of unique values for almost every data set which i'm using. I think that this shouldn't be so. I would appreciate any help in resolving this bug. – HumptyDumptyEIZ Jan 23 '15 at 02:20