Efficient comparison of long text strings stored into a Database

Question

I store application events into a database, extracted from a different text files.

Event object is as follow :

public class LogEvent
{
    public DateTime DateTime { get; set; }
    public LogLevel Level { get; set; }
    public string Message { get; set; } //can be lengthy
}

Please note that I do not own this structure and I cannot add any property like a unique Guid to the original generated object (but I can extend the class and create additional database columns from the information I have).

My problem is that I would like to ensure that I do not insert the same event twice despite it can be replicated over different files. The DateTime+Level properties are not sufficient to qualify for equality : different events can occur at the exact same time.

Therefore, whenever I insert an event / a list of events into the database, I need to compare against message properties which is highly inefficient due to the potential string length : it implies that I need to transmit one way or another the Message property of already inserted events, to compare them either locally either against a database index.

I thought about creating an additional property Hashcode which would store String.GetHashCode() of Message property. But, I read here that this is not a good practice since the Hashcode implementation is not stable across program executions (and potential collisions but this risk acceptable)

So, I am stuck with the following problem : how to build comparison value from long strings, which can be deterministic, fast to compute/compare against, and with a an acceptable rate of collision?. Strings can be up to a few thousands characters.

I am aware of Jon Skeet's answer to a similar question here, but it's quite old (almost 10 years) and I was wondering if there is a better method in 2020!

Thanks for your hints!

Your thought about a hash of the message is a good one, just don't use `String.GetHashCode()`, instead use a cryptographically secure (and stable) hashing algorithm and you should be set. This will still have the chance of a false positive though, low as it may be, so you should still do the full string comparison if the hashes compare equal. **However**, you said you did not own that structure, doesn't that imply you can't add the hash to the database either? If you have to read the string from the database, hash it, then hash the string you want to compare to and then compare the hashes..... — Lasse V. Karlsen, Aug 31 '20 at 11:39
hmm...not clear: "I cannot add any property..." then "I though about creating an additional property...". The structure is fixed or can be extended? — Mario Vernari, Aug 31 '20 at 11:39
Persisting `GetHashCode()` would be a bad idea, correct - however if you use a known and well-defined non-cryptographic hashing algorithm, such as `MurmurHash3`, then that works. I use that in my projects that need to do cheap one-way pushes of data to a relational database and that's worked great for me (I use it inside a `MERGE INTO` statement with megabytes of possibly-duplicate rows in a table-valued-parameter). — Dai, Aug 31 '20 at 11:39
@MarioVernari I think the OP means that they can't add members to `LogEvent` but they can add columns to their database table. — Dai, Aug 31 '20 at 11:40
@Dai That makes sense, and then a hashing/checksum algorithm would work out fine, even something basic as one of the CRC16/CRC32 implementations would drastically reduce the number of full string comparisons that has to be done. — Lasse V. Karlsen, Aug 31 '20 at 11:41
@MarioVernari and @ Dai correct, I can extend the LogEvent class but I cannot change how those events are generated by the application — XavierAM, Aug 31 '20 at 11:42
Even Jon's very naive hash is likely good enough for your purposes. A hash doesn't need to be particularly sophisticated to be effective, as long as it doesn't get too bad with collisions. Note that when using SQL Server as a database in particular, use of `CHECKSUM` and `BINARY_CHECKSUM` as hash functions should be avoided at all costs, as these functions are *too* naive and have terrible collision behavior. — Jeroen Mostert, Aug 31 '20 at 11:42
One remark: although the quality of the hash is largely unimportant if you're using it on data that can be trusted, if you're using it in a context where an attacker could potentially craft data and feed it to your application (especially in an online context), you may want to [reconsider](https://emboss.github.io/blog/2012/12/14/breaking-murmur-hash-flooding-dos-reloaded/) and use a crypto hash anyway. Even if slower, it will still be far faster than comparing entire values all the time, and hashing the values may well not be the bottleneck in your application to begin with. — Jeroen Mostert, Aug 31 '20 at 12:35
The short answer is "no"; there isn't a better answer than Skeet's. As @JeroenMostert said the quality of the hash function isn't important in this situation it's the stability and performance. Collisions can be handled easily by comparing the actual strings when needed. And I don't see why you'd bother keeping a hashset of the hashes in memory as that's won't scale. Just calculate the hashes and pass it the database while storing the strings. Use a proc or SQL script to prevent duplicates using the hash. — MikeJ, Aug 31 '20 at 13:20

Dai · Accepted Answer · 2020-08-31T12:12:35.397

To expand on my comment: Use the Murmur3 non-cryptographic hash algorithm. You can get it from NuGet here: https://www.nuget.org/packages/murmurhash/

Do not use the built-in GetHashCode() because, as you surmised, it isn't safe to persist outside of your process.
You can (but you shouldn't) use cryptographically-secure hash-functions because they're computationally expensive to calculate - and generally slow (not necessarily intentionally slow, but if SHA-256 was trivial to compute then I'd be a billionaire from finding SHA-256 hashes for Bitcoin mining).
- Whereas hashing-functions like Murmur are designed to be fast and fairly collision-resistant.

So here's what I'd do:

Write a function that serializes your LogEntry to a reusable MemoryStream for hashing by MurmurHash (the NuGet package I linked-to does not have the ability to automatically hash any object - and even if it did, you need a rigidly-defined hashing operation - as it is, serializing in-memory is the "best" approach for now). Provided you re-use the MemoryStream this won't be expensive.
Store the hash in your database and/or cache it in-memory to reduce IO ops.

In your case:

interface ILogEventHasher
{
    Int32 Compute32BitMurmurHash( LogEvent logEvent );
}

// Register this class as a singleton service in your DI container.
sealed class LogEventHasher : IDisposable
{
    private readonly MemoryStream ms = new MemoryStream();

    public Int32 Compute32BitMurmurHash( LogEvent logEvent )
    {
        if( logEvent is null ) throw new ArgumentNullException( nameof(logEvent) );

        this.ms.Position = 0;
        this.ms.Length   = 0; // This resets the length pointer, it doesn't deallocate memory.

        using( BinaryWriter wtr = new BinaryWriter( this.ms, Encoding.UTF8 ) )
        {
            wtr.Write( logEvent.DateTime );
            wtr.Write( logEvent.Level    );
            wtr.Write( logEvent.Message  );
        }

        this.ms.Position = 0; // This does NOT reset the Length pointer.

        using( Murmur32 mh = MurmurHash.Create32() )
        {
            Byte[] hash = mh.ComputeHash( this.ms );
            return BitConverter.ToInt32( hash ); // `hash` will be 4 bytes long.
        }

        // Reset stream state:
        this.ms.Position = 0;
        this.ms.Length = 0;

        // Shrink the MemoryStream if it's grown too large:
        const Int32 TWO_MEGABYTES = 2 * 1024 * 1024;
        if( this.ms.Capacity > TWO_MEGABYTES  )
        {
            this.ms.Capacity = TWO_MEGABYTES;
        }
    }

    public void Dispose()
    {
        this.ms.Dispose();
    }
}

To filter LogEvent instances in-memory, just use a HashSet<( DateTime utc, Int32 hash )>.

I don't recommend using HashSet<Int32> (storing just the Murmur hash-codes) because using a 32-bit non-cryptographically-secure hash-code doesn't give me enough confidence that a hash-code collision won't happen - but combining that with a DateTime value then gives me sufficient confidence (a DateTime value consumes 64 bits, or 8 bytes - so each memoized LogEvent will require 12 bytes. Given .NET's 2GiB array/object size limit (and assuming a HashSet load-factor of 0.75) means you can store up to 134,217,728 cached hash-codes in-memory. I hope that's enough!

Here's an example:

interface ILogEventFilterService
{
    Boolean AlreadyLoggedEvent( LogEvent e );
}

// Register as a singleton service.
class HashSetLogEventFilter : ILogEventFilterService
{
    // Somewhat amusingly, internally this HashSet will use GetHashCode() - rather than our own hashes, because it's storing a kind of user-level "weak-reference" to a LogEvent in the form of a ValueTuple.

    private readonly HashSet<( DateTime utc, Int32 hash )> hashes = new HashSet<( DateTime utc, Int32 hash )>();

    private readonly ILogEventHasher hasher;

    public HashSetLogEventFilter( ILogEventHasher hasher )
    {
        this.hasher = hasher ?? throw new ArgumentNullException( nameof(hasher) );
    }

    public Boolean AlreadyLoggedEvent( LogEvent e )
    {
        if( e is null ) throw new ArgumentNullException( nameof(e) );

        if( e.DateTime.Kind != DateTimeKind.Utc )
        {
            throw new ArgumentException( message: "DateTime value must be in UTC.", paramName: nameof(e) );
        }

        Int32 murmurHash = this.hasher.HashLogEvent( e );
        
        var t = ( utc: e.DateTime, hash: murmurHash );
        
        return this.hashes.Add( t ) == false;
    }
}

If you want to do it in the database directly, then define a custom user-defined-table-type for a table-valued-parameter for a stored-procedure that runs a MERGE statement of this form:

CREATE TABLE dbo.LogEvents (
    Utc        datetime2(7)   NOT NULL,
    MurmurHash int            NOT NULL,
    LogLevel   int            NOT NULL,
    Message    nvarchar(4000) NOT NULL
);

MERGE INTO dbo.LogEvents AS tgt WITH ( HOLDLOCK ) -- Always MERGE with HOLDLOCK!!!!!
USING @tvp AS src ON src.DateTime = tgt.DateTime AND src.MurmurHash = tgt.MurmurHash
WHEN NOT MATCHED BY TARGET THEN
    INSERT(     Utc,     MurmurHash,     LogLevel,     Message )
    VALUES( src.Utc, src.MurmurHash, src.LogLevel, src.Message )
;

Just a minor comment: it is not (necessarily) true that a crypto hash is slow, whether by accident or on purpose; no system particularly requires that hashes be slow to calculate for security, at best that they take constant time to avoid certain timing attacks. On the contrary, optimizing hash speed is often desired, which is why [dedicated instructions](https://en.wikipedia.org/wiki/Intel_SHA_extensions) are a thing. Slowness does come into play when deriving a key from a password by hashing, but then you're talking many, many iterations of a hash to force this. — Jeroen Mostert, Aug 31 '20 at 11:58
@JeroenMostert Correct! - and thank you for the feedback! I'll clarify my claim about cryptographic hashes being slow now. — Dai, Aug 31 '20 at 12:00
Well, now you've sort of made it worse by claiming that slowness *is* necessary for security. :-) You'd be a billionaire if you could efficiently find *collisions* of SHA-256 values, and that is of course something crypto hashes are designed to avoid at all costs -- but the speed with which you can calculate it isn't relevant for that, because that would imply brute forcing is a viable attack in the first place. The real reason to avoid a full crypto hash would probably be that smaller hash values are faster to store and process, but (truncated) crypto hashes do have library support as a plus. — Jeroen Mostert, Aug 31 '20 at 12:07
Specifically, to elaborate on the previous point a bit: if you stored a (truncated) MD5 or SHA-1 hash, SQL Server could calculate the hash server-side with the built-in `HASHBYTES` function, whereas a custom hash like Murmur3 can only be done client-side (or require a more complicated setup where you make the hash available as a CLR function). Whether or not easy server-side hashing is a plus depends on the scenario, of course. — Jeroen Mostert, Aug 31 '20 at 12:09
@JeroenMostert I tend to avoid hashing within SQL Server because there's way too many pitfalls and gotchas when it comes to handling text in general (e.g. leading/trailing whitespace, varchar vs char, UTF-8 with or without a leading BOM, etc) and SQL Server does not seem to offer the degree of control required to generate hashes reliably in a cross-platform manner - sucks :/ — Dai, Aug 31 '20 at 12:14
@JeroenMostert Of course, there is SQL-CLR... but that's kinda DOA now, as Azure SQL doesn't support it and I don't think there's any plans for it to support .NET Core either. — Dai, Aug 31 '20 at 12:15
@JeroenMostert Forgive me, it's 5am here and crashing-out - can you please edit my answer to provide a good explanation why using Murmur is probably a better algo than using a crypto-hashing algorithm for the OP's problem? I know what you're saying but I'm too tired to reword it into my answer :( — Dai, Aug 31 '20 at 12:16
It's true that you do need very specific rules on how things are done (meaning, hash either pure binary values and/or consistently use `NVARCHAR` so you only have to deal with UTF-16), so it's not a general solution -- I just mentioned it for completeness.. And while Azure SQL is neat, on-premises SQL Server isn't going anywhere soon and neither is CLR integration for that. :-) — Jeroen Mostert, Aug 31 '20 at 12:17
I would edit your answer but I'm not sure it *is* better; I haven't run my own benchmarks. The built-in crypto hashes are often well optimized by people who know their stuff, I don't know if (this particular use of) Murmur3 will beat it and if it does, whether it beats it significantly and whether this matters in the OP's scenario. They should do their own measuring and/or decide if it's important for them to have a hash function that's "easy" to calculate by standard libraries (notwithstanding that Murmur3 has very many implementations across languages itself). — Jeroen Mostert, Aug 31 '20 at 12:22

score 0 · Answer 2 · answered Aug 31 '20 at 11:38

0

Step 1. Compare them by length. It'll cut off the most of them. Step 2. Compare the strings with the same length by the 1st character ... etc.

answered Aug 31 '20 at 11:38

Roman Ryzhiy

1,540
8
5

Efficient comparison of long text strings stored into a Database

2 Answers2