1

I would like to make a sort of hash key out of a text (in my case html) that would match/compare to the hash of other similar text

ex of matching texts:

  • "2012/10/01 This is my webpage #1"+ 100k_of_same_text + random_words_1 + ..
  • "2012/10/02 This is my webpage #2"+ 100k_of_same_text + random_words_2 + ..
  • ...
  • "2012/10/02 This is my webpage #2"+ 100k_of_same_text + random_words_3 + ..

So far I've thought of removing numbers and tags but that wold still leave the random words.

Is there anything out there that dose this?

I have root access to the server so I can add any UDF that is necesare and if needed I can do the processing in c or other languages.

The ideal would be a function like generateSimilarHash(text) and an other function compareSimilarHashes(hash1,hash2) that would return the procent of matching text.

Any function like compare(text1,text2) would not work as in my case as I have many pages to compare (~20 mil at the moment)

Any advice is welcomed!

Stefan Rogin
  • 1,499
  • 3
  • 25
  • 41

3 Answers3

2

I've never had to do anything quite like this, so just throwing something out there based on general hashing knowledge.

First, in general, I doubt you can represent the entire string you want to compare as one value hashed therefrom, then meaningfully find approximate matches using just that. Hashing functions are generally designed to produce a huge pseudo-random difference in output value from the tiniest change in input value - so used naively they're not a good match for this problem, but...

What might work is using some convention for breaking long text into subsections, such as looking for terminating punctuation (full stop, exclamation mark, question mark) at least N characters apart, then you could hash those individual substrings and use a count of matching hashes to approximate the amount of matching text.

You'd have to work out a suitable level of granularity to divide the text into a reasonable number of distinct hashes - balancing the size of your hashes and speed of hash-comparisons against the accuracy of matches. You may also want to do some prior transforms, such as converting characters to a single case or replacing each area of one or more whitespace characters with a single space, perhaps replacing punctuation with a space: that way trivial differences won't cause hashes to mismatch - tune to taste.

In your example:

"2012/10/01 This is my webpage #1"+ 100k_of_same_text + random_words_1 + ..

Say you break on full-stops, or sans full-stops we find local minima in sorted word order such that max 5-20 words appear in a section... you might end up with substrings like:

  • "2012/10/01 This is my webpage #1."
  • "This is the first bit from 100k of text."
  • "This is the second bit from 100k of text."
  • "Yet another bit from the 100k."
  • "chicken book dog crayon stick hug"
    • break due to "apple" being local min
  • "apple twig paper glove bookend ibm"
    • break due to "activation" being local min
  • "activation usurper triad monkey wrench."
    • break on "."
  • "zebra italy quark stew century dinosaur jacket egg trick"
    • break due to "chicken" being local min; "century" is < 5 words in
  • "chicken joke road bad"

Then you use a normal string hashing function on each of the above. To compare this to other similarly-hashed text, you look for the number of matching hash values (if you don't assign some importance to the order or contiguousness of matching subsections of text, it's pretty efficient to iterate over pre-sorted lists of both sets of hashes, or prepopulate hash tables with the hash values then seek each in turn).

Tony Delroy
  • 102,968
  • 15
  • 177
  • 252
  • using fixed-size blocks might be more convenient than splitting up at punctuation. – Sander De Dycker Feb 07 '13 at 09:54
  • @SanderDeDycker Definitely convenient if it would work, but say you have "differentabcdefghijklmnopqrstuvwxyzotherstuff" and "changedabcdefghijklmnopqrstuvwxyzmorestuff", if you just slice N-character blocks off the front, you're likely to hash different subsections of the common text. Some approach to resyncing boundaries is necessary. I couldn't think of a good approach using fixed-size blocks, but I've only spent 10 minutes on this. – Tony Delroy Feb 07 '13 at 09:59
  • I've found that separating by punctuation ( or tags) seams to be better for what I'm trying to achieve, as the random text seems to have random length. I was thinking of taking letters from these more stable points in order and add them to a string and then use a function to compare and give score, ex: compare_MyHash("abcdefghij","ebcdeflhin") = 0.7 or 0.5 if I give more importance to starting characters. – Stefan Rogin Feb 07 '13 at 12:50
1

I thought I would answer this question as I am looking at a similar problem. The name for this idea of hashes of similar objects having a high chance of clashing is 'locality sensitive hashing'. There is a lot of literature on the subject, but here is a simple example:

Imagine we have a binary vector {1,0} of a fixed length. We can select a random subset of indices on which to calculate the hash using builtin stl and boost algorithms:

#include <unordered_map>
#include <unordered_set>
#include <random>
#include <algorithm>
#include <boost/iterator/filter_iterator.hpp>
#include <boost/functional/hash.hpp>


template<class It, class pred>
std::size_t hash_filtered_range(It first, It last, pred f){

    return boost::hash_range(boost::make_filter_iterator(f, first, last),
                             boost::make_filter_iterator(f, last, last));


}

template<class iter>
struct IterableHash{
    IterableHash(const iter indices_begin, const iter indices_end): _inc_indices(indices_begin, indices_end){
    }

    template <class obj_type>
    std::size_t operator()(const obj_type& type)const{
        int _ix = 0;
        return hash_filtered_range(std::begin(type), std::end(type), [this, &_ix](const auto& t){
           return (this->_inc_indices.find(_ix++) != this->_inc_indices.end());
        });
    }

private:
    std::unordered_set<int> _inc_indices;

};


template<class hasher>
struct ApproxEqual{
    ApproxEqual(const hasher& hash):hash(hash) {}

    template<class obj_type>
    bool operator() (const obj_type& o1, const obj_type& o2)const{
        return hash(o1) == hash(o2);

    }
private:
    hasher hash;

};

Then, iterable objects have the same hash and equal values if they are equal only at these indices:

i.e on my computer

std::vector<int> hash_vec{0,2,3};

    using it = std::vector<int>::iterator;
    IterableHash<it> hasher(hash_vec.begin(),
                     hash_vec.end());

    ApproxEqual<IterableHash<it>> cmp(hasher);
    std::unordered_map<std::vector<char>, int, IterableHash<it>, ApproxEqual<IterableHash<it>> > map( 0, hasher,
                                                                                            cmp);
    std::vector<char> vec {1,0,1,0,1};
    map[vec] = 33;

    std::cout << hasher(vec)<< "\n";


    std::vector<char> fuzzy_vec {1,0,1,0,0};

    std::cout << hasher(fuzzy_vec)<< "\n";
    std::cout << (map.find(fuzzy_vec)->second);

produces

11093822460655

11093822460655

33

i.e we recover the value for a different vector res when we query with fuzzy_res;

user3684792
  • 2,542
  • 2
  • 18
  • 23
0

You can try using the DJB Hash algorithm on your random words. And then compare the hash keys. Indeed, there is always a small chance that two different text gives the same result... But if the hash on 32 bits is not enough, you can extend it into 64 bits and/or keep a reference to there text to compare them when a hash is identical.

More details here : DJB Hash

Community
  • 1
  • 1
Stephane D.
  • 142
  • 7