-7

I have a large number of pieces of text which I need to compare between themselves to check whether or not they are similar. Each piece is about 10000 words long.
Hence I'll pre-calculate a hash of each one and compare hashes.

The question is, which hash algorithm would be better for that? md5? sha1? sha256? Or perhaps base64? Or maybe it doesn't even really matter?

I'm aware that even single whitespace can change the value of a hash, that's ok with me.

Rahul
  • 18,271
  • 7
  • 41
  • 60
Jodooomi
  • 369
  • 5
  • 12
  • What have you tried? A search for "text similarity" on google provides quite a lot of results... – Laur Ivan Jan 18 '17 at 09:19
  • Hashing will only speed the process of finding *exact duplicates*. As you say, any minor change will produce a different hash code. Are you trying to find exact duplicates? If you're looking for similarity (rearranged paragraphs, a few changed words, slightly varying sentence structure, etc.), you'll need a much more involved algorithm. – Jim Mischel Jan 18 '17 at 15:37
  • @JimMischel, learn to read. – Jodooomi Jan 18 '17 at 16:57
  • Rudeness is not required. *Your requirements are unclear.* You said that you want to "check whether or not they similar." My point is that "similar" is not necessarily "exact duplicate." I did acknowledge your comment about being okay with whitespace changing the value of a hash. You need to clarify whether you're looking for similarity (i.e. this text is somewhat like that other text) or *exact duplicate*. – Jim Mischel Jan 18 '17 at 17:45
  • @JimMischel, required. – Jodooomi Jan 19 '17 at 03:24

2 Answers2

0

Use zlib.crc32 then do textual compare of texts with matching hashes to make sure.

Paddy3118
  • 4,704
  • 27
  • 38
  • The above would be a check for equality - including spaces, punctuation etc. I chose it because I have used it to compare millions of sequences of 4K bytes before to see if I chanced on even one case of a single bit flip not being caught by the CRC - After getting a statistical value of how unlikely it would be of course. – Paddy3118 Jan 18 '17 at 09:39
  • 1
    yes, how do you know it's better than base64, md5, sha1, etc, for this task? – Jodooomi Jan 18 '17 at 09:46
  • I don't know it is better. I think it will suffice however. – Paddy3118 Jan 18 '17 at 11:42
0

When does hashing work?

What hashing does is reduce search space so that equivalent items can be found more quickly. It works whenever there is a reliable way to produce a single canonical value for all members of an equivalence class.

Selecting a unique value among equivalent strings

Before hashing, the strings need to be converted to a canonical value (one unique representation among all equivalent strings).

I'm aware that even a single whitespace can change the value of a hash, that's ok with me.

For your application, here is possible canonicalizing function that just removes whitespace:

>>> def canonical(s):
        return ''.join([c for c in s if not c.isspace()])

>>> s = 'the   quick\nbrown\tfox jumped'
>>> t = '  the\tquick   brown  fox  jumped'
>>> canonical(s)
'thequickbrownfoxjumped'
>>> canonical(t)
'thequickbrownfoxjumped'

Applying a hash function

A sha256() is fast and has almost no chance of a false positive.

In Python 2, you can compute the sha256 directly from a string. However, in Python 3, the string must first be encoded into bytes:

>>> from hashlib import sha256
>>> sha256(canonical(s).encode()).hexdigest()
'2c31c202821431b015cb800ab6315289884e87f1ed023abc876915685c620919'
>>> sha256(canonical(t).encode()).hexdigest()
'2c31c202821431b015cb800ab6315289884e87f1ed023abc876915685c620919'

When won't hashing work?

If you just want to group by text similarity, hashing doesn't work as well because there isn't a straight-forward way to choose a representative element and because similarity is isn't a transitive relation (a is close to b and b is close to c doesn't imply that a is close to c).

Raymond Hettinger
  • 216,523
  • 63
  • 388
  • 485