Your answer contains code that only runs in Python 3 and code that only runs in Python 2. It is not clear which version you are developing for. This answer is written for Python 3.
IMHO, message deduplication should be a property of the logger and not be restricted to one of its methods. Hence, I'd create a dedicated logger for this purpose:
Note: I am using a list as cache container because your examples do not show a requirement for caching duplicates under different keys, or the need to access the cached messages for anything else than the dupe check.
import logging
# define the cache at the module level if different logger instances from different classes
# shall share the same cache
# _message_cache = []
class NonRepetitiveLogger(logging.Logger):
# define the cache as class attribute if all logger instances of _this_ class
# shall share the same cache
# _message_cache = []
def __init__(self, name, level=logging.NOTSET):
super().__init__(name=name, level=level)
# define the cache as instance variable if you want each logger instance
# to use its own cache
self._message_cache = []
def _log(self, level, msg, args, exc_info=None, extra=None, stack_info=False):
msg_hash = hash(msg) # using hash() builtin; see remark below
if msg_hash in self._message_cache:
return
self._message_cache.append(msg_hash)
super()._log(level, msg, args, exc_info, extra, stack_info)
This allows for the use of the native Logger
interface:
logger = NonRepetitiveLogger("test")
sh = logging.StreamHandler()
sh.setFormatter(logging.Formatter('[%(levelname)s] - %(message)s'))
logger.addHandler(sh)
logger.setLevel(logging.DEBUG)
print(logger)
logger.debug("foo")
logger.error("foo")
logger.info("bar")
logger.info("foo")
logger.warning("foo")
# Output
<NonRepetitiveLogger test (DEBUG)>
[DEBUG] - foo
[INFO] - bar
If you want your root logger to be an instance of your custom logger class, as the code in your MyOwnLogger
initializer suggests, monkey-patch the logging
module:
logging.root = logging.Logger.root = logging.Logger.manager.root = NonRepetitiveLogger(name="root")
print(logger.root)
print(logger.root is logging.getLogger())
# Output
<NonRepetitiveLogger root (NOTSET)>
True
You can also set your logger class as default, so that logging.getLogger
returns instances thereof:
logging.setLoggerClass(NonRepetitiveLogger)
print(logging.getLogger("test2"))
# Output
<NonRepetitiveLogger test2 (NOTSET)>
Should you really need the key-based duplicate lookup, you can still implement it using the approach outlined above.
The following example defines the cache container as class attribute, so that the __init__
reimplementation can be omitted.
class NonRepetitiveLogger(logging.Logger):
_message_cache = {}
# cache is class attribute, no need to override __init__ to define it for the instance
# note the additional **kwargs parameter
def _log(self, level, msg, args, exc_info=None, extra=None, stack_info=False, **kwargs):
cache_key = kwargs.get("cache_key")
msg_hash = hash(msg)
if self._message_cache.get(cache_key) == msg_hash:
return
if cache_key is not None:
self._message_cache[cache_key] = msg_hash
super()._log(level, msg, args, exc_info, extra, stack_info)
This still widely maintains the native interface. However the cache key must now be passed as keyword argument:
# logger instantiation just like before
logger.debug("foo", cache_key="1")
logger.error("foo", cache_key="1")
logger.info("bar", cache_key="1")
logger.info("foo", cache_key="1")
logger.warning("foo", cache_key="2")
logger.error("foo")
# Output
[DEBUG] - foo # new item in cache; key = 1
[INFO] - bar # value for key 1 overwritten
[INFO] - foo # value for key 1 overwritten
[WARNING] - foo # new item in cache; key = 2
[ERROR] - foo # no cache_key provided; always logs
If you'd rather ignore messages that don't come with a cache_key
, simply check for that when deciding whether to return early:
if cache_key is None or self._message_cache.get(cache_key) == msg_hash:
return
Regarding your use of md5
:
I consider the benefit of hashing the messages debatable. Not knowing your potential number of unique log messages and their length, I somewhat doubt that the memory footprint of your cache would be significant if you just stored the message strings directly.
If you wish to use checksums nevertheless, I'd recommend the builtin hash()
instead.
md5
is a cryptographic hash function, hence rather costly, compared to hash
. The time required to hash a (byte-)string scales with its size.
Using timeit
, we can time short portions of Python code. The following defines four string objects with different sizes and prints the average execution time of 10,000 iterations of each, md5
and hash
, for each string:
string1 = "a"
string100 = string1 * 100
string10000 = string1 * 10000
string100000 = string1 * 100000
number = 10000
for l in (1, 100, 10000, 100000):
for alg in [('md5', '.encode()'), ('hash', '')]:
a, f = alg
res = timeit("{}(string{}{})".format(a, l, f), globals=globals(), number=number)
print("{:<6}{:>6} chars => {:>7.0f} ns/op".format(a, l, (res/number)*1000000000))
# Output
md5 1 chars => 507 ns/op
hash 1 chars => 85 ns/op
md5 100 chars => 649 ns/op
hash 100 chars => 89 ns/op
md5 10000 chars => 17252 ns/op
hash 10000 chars => 86 ns/op
md5 100000 chars => 168031 ns/op
hash 100000 chars => 97 ns/op
As you can see, while hash
's wall time remains at around 90 nanoseconds, no matter the size of the string, the time it takes md5
to return scales up to 168 microseconds.
Using md5
will certainly not affect your program's performance in a tangible way, but there's no need to use a cryptographick hash when a simple checksum calculator siffices.
Also, hash
results are "shorter" than those of md5
: on a 64 bit system (i.e. returning a 64 bit signed integer), the result represented as string is either 19 or 20 characters long, depending on whether it is positive or negative.