To expand on my comment: Use the Murmur3
non-cryptographic hash algorithm. You can get it from NuGet here: https://www.nuget.org/packages/murmurhash/
- Do not use the built-in
GetHashCode()
because, as you surmised, it isn't safe to persist outside of your process.
- You can (but you shouldn't) use cryptographically-secure hash-functions because they're computationally expensive to calculate - and generally slow (not necessarily intentionally slow, but if SHA-256 was trivial to compute then I'd be a billionaire from finding SHA-256 hashes for Bitcoin mining).
- Whereas hashing-functions like Murmur are designed to be fast and fairly collision-resistant.
So here's what I'd do:
- Write a function that serializes your
LogEntry
to a reusable MemoryStream
for hashing by MurmurHash (the NuGet package I linked-to does not have the ability to automatically hash any object - and even if it did, you need a rigidly-defined hashing operation - as it is, serializing in-memory is the "best" approach for now). Provided you re-use the MemoryStream
this won't be expensive.
- Store the hash in your database and/or cache it in-memory to reduce IO ops.
In your case:
interface ILogEventHasher
{
Int32 Compute32BitMurmurHash( LogEvent logEvent );
}
// Register this class as a singleton service in your DI container.
sealed class LogEventHasher : IDisposable
{
private readonly MemoryStream ms = new MemoryStream();
public Int32 Compute32BitMurmurHash( LogEvent logEvent )
{
if( logEvent is null ) throw new ArgumentNullException( nameof(logEvent) );
this.ms.Position = 0;
this.ms.Length = 0; // This resets the length pointer, it doesn't deallocate memory.
using( BinaryWriter wtr = new BinaryWriter( this.ms, Encoding.UTF8 ) )
{
wtr.Write( logEvent.DateTime );
wtr.Write( logEvent.Level );
wtr.Write( logEvent.Message );
}
this.ms.Position = 0; // This does NOT reset the Length pointer.
using( Murmur32 mh = MurmurHash.Create32() )
{
Byte[] hash = mh.ComputeHash( this.ms );
return BitConverter.ToInt32( hash ); // `hash` will be 4 bytes long.
}
// Reset stream state:
this.ms.Position = 0;
this.ms.Length = 0;
// Shrink the MemoryStream if it's grown too large:
const Int32 TWO_MEGABYTES = 2 * 1024 * 1024;
if( this.ms.Capacity > TWO_MEGABYTES )
{
this.ms.Capacity = TWO_MEGABYTES;
}
}
public void Dispose()
{
this.ms.Dispose();
}
}
To filter LogEvent
instances in-memory, just use a HashSet<( DateTime utc, Int32 hash )>
.
I don't recommend using HashSet<Int32>
(storing just the Murmur hash-codes) because using a 32-bit non-cryptographically-secure hash-code doesn't give me enough confidence that a hash-code collision won't happen - but combining that with a DateTime
value then gives me sufficient confidence (a DateTime
value consumes 64 bits, or 8 bytes - so each memoized LogEvent
will require 12 bytes. Given .NET's 2GiB array/object size limit (and assuming a HashSet load-factor of 0.75) means you can store up to 134,217,728 cached hash-codes in-memory. I hope that's enough!
Here's an example:
interface ILogEventFilterService
{
Boolean AlreadyLoggedEvent( LogEvent e );
}
// Register as a singleton service.
class HashSetLogEventFilter : ILogEventFilterService
{
// Somewhat amusingly, internally this HashSet will use GetHashCode() - rather than our own hashes, because it's storing a kind of user-level "weak-reference" to a LogEvent in the form of a ValueTuple.
private readonly HashSet<( DateTime utc, Int32 hash )> hashes = new HashSet<( DateTime utc, Int32 hash )>();
private readonly ILogEventHasher hasher;
public HashSetLogEventFilter( ILogEventHasher hasher )
{
this.hasher = hasher ?? throw new ArgumentNullException( nameof(hasher) );
}
public Boolean AlreadyLoggedEvent( LogEvent e )
{
if( e is null ) throw new ArgumentNullException( nameof(e) );
if( e.DateTime.Kind != DateTimeKind.Utc )
{
throw new ArgumentException( message: "DateTime value must be in UTC.", paramName: nameof(e) );
}
Int32 murmurHash = this.hasher.HashLogEvent( e );
var t = ( utc: e.DateTime, hash: murmurHash );
return this.hashes.Add( t ) == false;
}
}
If you want to do it in the database directly, then define a custom user-defined-table-type for a table-valued-parameter for a stored-procedure that runs a MERGE
statement of this form:
CREATE TABLE dbo.LogEvents (
Utc datetime2(7) NOT NULL,
MurmurHash int NOT NULL,
LogLevel int NOT NULL,
Message nvarchar(4000) NOT NULL
);
MERGE INTO dbo.LogEvents AS tgt WITH ( HOLDLOCK ) -- Always MERGE with HOLDLOCK!!!!!
USING @tvp AS src ON src.DateTime = tgt.DateTime AND src.MurmurHash = tgt.MurmurHash
WHEN NOT MATCHED BY TARGET THEN
INSERT( Utc, MurmurHash, LogLevel, Message )
VALUES( src.Utc, src.MurmurHash, src.LogLevel, src.Message )
;