3

Note: Lots of Possible duplicates, but nothing seems to be solving my problem.

I am working on a Plagiarism detection based on MOSS.

After successfully implementing a filter which strips out all the necessary details(comments,punctuations etc) I hash the content using a Rolling Hash Implementation(Rabin Karp)

However the Hashes which match in two text-files of source code, have very different underlying text(No plagiarism and yet same hashes)

The Algorithm I implemented(Ruby) --> (Partial Snippet)

 #Preprocessing from RobinKarp Algorithm
  for c in 0...k do
    text_hash=(radix*text_hash+text_to_process[c].ord)%q
  end

  #Main loop
  for c in 0...loop do   
        text_hash=((radix*text_hash-(text_to_process[c].ord)*highorder)+(text_hash[c+k].ord))%q    

Is there an issue with my implementation? Or the Parameters I specify can be at fault?

I take radix=34 ( I am not sure if it is the right value, I am assuming the stripped out text will only contain alphabets+some special charcters like '+','-','*','/' so a rough estimate of total 34 characters)

I am taking q(prime) to be 101

Is this a collision issue I am dealing with? Any pointers as to how to tackle the problem?

templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065
Nitish Upreti
  • 6,312
  • 9
  • 50
  • 92

1 Answers1

2

I note that with q = 101, there are only 101 possible hash values - 0, 1, 2...100. Have you tried increasing q? Another approach would be to look and see if the hash values look like they are randomly chosen values within the possible values of 0,1..q-1.

You should of course also test your program on cases where there are repeated strings for it to find - a failure there could be another symptom of any problem that is also causing collisions, and it would be easier to find and debug.

mcdowella
  • 19,301
  • 2
  • 19
  • 25
  • Increasing q to 1001 is definitely restricting the matches further but still there bogus matches with underlying text different. – Nitish Upreti Jan 15 '12 at 08:52
  • And what do you think should be the optimum value for radix? – Nitish Upreti Jan 15 '12 at 08:59
  • I would make the radix one larger than the maximum character value you expect. Then the text hash will be uniquely determined by the character sequence it hashes, except for the fact that it is computed mod q. Note that you can expect a few collisions even for very large values of q - this is an example of http://en.wikipedia.org/wiki/Birthday_problem. – mcdowella Jan 15 '12 at 12:50