7

Does anyone knows a simple perceptual hash algorithm for text ? I took a look in the pHash function ph_texthash but I want a more simple algorithm. Preferably in Python. Thank you !

Tarantula
  • 19,031
  • 12
  • 54
  • 71
  • 1
    http://stackoverflow.com/questions/3216901/perceptual-hash-algorithms-in-python-or-php <- Perhaps can help you? –  Jun 30 '11 at 18:54
  • 1
    Though, that question had no definitive answer, there was a bit of a discussion that might be of assistance. –  Jun 30 '11 at 18:55
  • I think you are using the wrong terms, that's why you did not come up with anything. For text you'd usually use a vector of n-grams either of words of characters (this will give you something like edit/levenshtein distance). If you want it more "perceptual", I'd have a look at latent semantic indexing or latent dirichlet allocation. The academic field that deals with this is is Information Retrieval. – Maarten Nov 13 '13 at 03:12
  • What kind of similarities are you looking for in texts? – Ziyuan Dec 28 '18 at 23:03
  • This is a question from almost 9 years ago. – Tarantula Dec 30 '18 at 16:27

1 Answers1

3

A blog post about perceptual hash functions (in the imaging context):

and some related python code (dealing with images, not text, but may be adaptable):


As I understand this short presentation about Perceptual Hashing of Textual Content, there are numerous approaches (in different dimensions such as the level of the text, linguistic or statistical approach, the model chosen to represent the text, ...), and the right one will depend on your domain and the problems you try to solve.

Also you might look into Locality-sensitive hashing, which

is a method of performing probabilistic dimension reduction of high-dimensional data. The basic idea is to hash the input items so that similar items are mapped to the same buckets with high probability (the number of buckets being much smaller than the universe of possible input items)

miku
  • 181,842
  • 47
  • 306
  • 310
  • I've already read all links hehe, but I think that I'll end up using some linguistic approach as you cited, thanks for the answer. – Tarantula Jun 30 '11 at 23:28
  • Ok, sorry, so nothing new for you. I'm curious, how you'll solve the problem - any chance that you can share your algorithm or code? – miku Jul 01 '11 at 00:59
  • 1
    My problem is that the text I'll need to hash has a specific structure, so I'll need to think in an algorithm which would be adapted to changes in this structure, it has text and real values together, it will take time, but later I'll try to post it here. – Tarantula Jul 01 '11 at 12:00