0

My script is waiting on a task to complete. While waiting, it polls every 30 seconds and logs a message "Task X is in progress".

Since the script waits for 10 minutes, I get 20 messages with the same content.

To add on top of this, the script polls through several tasks of the same kind, so the log looks like this:

"Task X is in progress"
"Task Y is in progress"
"Task Z is in progress"
"Task X is in progress"
"Task Y is in progress"
"Task Z is in progress"
"Task X is in progress"
"Task Y is in progress"
"Task Z is in progress"
"Task X is complete"
"Task Y is complete"
"Task Z is complete"

How would I construct a new logging method that can handle such repeating messages and print each message only once.

My solution is in the answers, but if you are familiar with other methods, I'd be happy to learn.

RaamEE
  • 3,017
  • 4
  • 33
  • 53

2 Answers2

2

Your answer contains code that only runs in Python 3 and code that only runs in Python 2. It is not clear which version you are developing for. This answer is written for Python 3.

IMHO, message deduplication should be a property of the logger and not be restricted to one of its methods. Hence, I'd create a dedicated logger for this purpose:

Note: I am using a list as cache container because your examples do not show a requirement for caching duplicates under different keys, or the need to access the cached messages for anything else than the dupe check.

import logging

# define the cache at the module level if different logger instances from different classes
# shall share the same cache
# _message_cache = []

class NonRepetitiveLogger(logging.Logger):
    # define the cache as class attribute if all logger instances of _this_ class
    # shall share the same cache
    # _message_cache = []

    def __init__(self, name, level=logging.NOTSET):
        super().__init__(name=name, level=level)
        # define the cache as instance variable if you want each logger instance
        # to use its own cache
        self._message_cache = []

    def _log(self, level, msg, args, exc_info=None, extra=None, stack_info=False):
        msg_hash = hash(msg) # using hash() builtin; see remark below
        if msg_hash in self._message_cache:
            return
        self._message_cache.append(msg_hash)
        super()._log(level, msg, args, exc_info, extra, stack_info)

This allows for the use of the native Logger interface:

logger = NonRepetitiveLogger("test")
sh = logging.StreamHandler()
sh.setFormatter(logging.Formatter('[%(levelname)s] - %(message)s'))
logger.addHandler(sh)
logger.setLevel(logging.DEBUG)
print(logger)

logger.debug("foo")
logger.error("foo")
logger.info("bar")
logger.info("foo")
logger.warning("foo")

# Output
<NonRepetitiveLogger test (DEBUG)>
[DEBUG] - foo
[INFO] - bar

If you want your root logger to be an instance of your custom logger class, as the code in your MyOwnLogger initializer suggests, monkey-patch the logging module:

logging.root = logging.Logger.root = logging.Logger.manager.root = NonRepetitiveLogger(name="root")
print(logger.root)
print(logger.root is logging.getLogger())

# Output
<NonRepetitiveLogger root (NOTSET)>
True

You can also set your logger class as default, so that logging.getLogger returns instances thereof:

logging.setLoggerClass(NonRepetitiveLogger)
print(logging.getLogger("test2"))

# Output
<NonRepetitiveLogger test2 (NOTSET)>

Should you really need the key-based duplicate lookup, you can still implement it using the approach outlined above.
The following example defines the cache container as class attribute, so that the __init__ reimplementation can be omitted.

class NonRepetitiveLogger(logging.Logger):
    _message_cache = {}
    # cache is class attribute, no need to override __init__ to define it for the instance

    # note the additional **kwargs parameter
    def _log(self, level, msg, args, exc_info=None, extra=None, stack_info=False, **kwargs):
        cache_key = kwargs.get("cache_key")
        msg_hash = hash(msg)
        if self._message_cache.get(cache_key) == msg_hash:
            return
        if cache_key is not None:
            self._message_cache[cache_key] = msg_hash
        super()._log(level, msg, args, exc_info, extra, stack_info)

This still widely maintains the native interface. However the cache key must now be passed as keyword argument:

# logger instantiation just like before
logger.debug("foo", cache_key="1")
logger.error("foo", cache_key="1")
logger.info("bar", cache_key="1")
logger.info("foo", cache_key="1")
logger.warning("foo", cache_key="2")
logger.error("foo")

# Output
[DEBUG] - foo   # new item in cache; key = 1
[INFO] - bar    # value for key 1 overwritten
[INFO] - foo    # value for key 1 overwritten
[WARNING] - foo # new item in cache; key = 2
[ERROR] - foo   # no cache_key provided; always logs

If you'd rather ignore messages that don't come with a cache_key, simply check for that when deciding whether to return early:

        if cache_key is None or self._message_cache.get(cache_key) == msg_hash:
            return

Regarding your use of md5:
I consider the benefit of hashing the messages debatable. Not knowing your potential number of unique log messages and their length, I somewhat doubt that the memory footprint of your cache would be significant if you just stored the message strings directly.

If you wish to use checksums nevertheless, I'd recommend the builtin hash() instead.

md5 is a cryptographic hash function, hence rather costly, compared to hash. The time required to hash a (byte-)string scales with its size.

Using timeit, we can time short portions of Python code. The following defines four string objects with different sizes and prints the average execution time of 10,000 iterations of each, md5 and hash, for each string:

string1 = "a"
string100 = string1 * 100
string10000 = string1 * 10000
string100000 = string1 * 100000

number = 10000

for l in (1, 100, 10000, 100000):
    for alg in [('md5', '.encode()'), ('hash', '')]:
        a, f = alg
        res = timeit("{}(string{}{})".format(a, l, f), globals=globals(), number=number)
        print("{:<6}{:>6} chars => {:>7.0f} ns/op".format(a, l, (res/number)*1000000000))

# Output
md5        1 chars =>     507 ns/op
hash       1 chars =>      85 ns/op
md5      100 chars =>     649 ns/op
hash     100 chars =>      89 ns/op
md5    10000 chars =>   17252 ns/op
hash   10000 chars =>      86 ns/op
md5   100000 chars =>  168031 ns/op
hash  100000 chars =>      97 ns/op

As you can see, while hash's wall time remains at around 90 nanoseconds, no matter the size of the string, the time it takes md5 to return scales up to 168 microseconds.

Using md5 will certainly not affect your program's performance in a tangible way, but there's no need to use a cryptographick hash when a simple checksum calculator siffices.
Also, hash results are "shorter" than those of md5: on a 64 bit system (i.e. returning a 64 bit signed integer), the result represented as string is either 19 or 20 characters long, depending on whether it is positive or negative.

shmee
  • 4,721
  • 2
  • 18
  • 27
  • Thanks @shmee for your implementation. To reply to the points you raised: 1) My answer's code runs under Python 3.7.3 2) I wasn't aware of the hash, so thanks for that. 3) My use case requires the caching. Msgs are customized per instance and are expected for specific messages which I know will repeat, e.g. log during polling, so I limit the cache retrieval to this custom level "UNIQUE". It also helps to identify Non-Repeating messages in the log, as they are TAGGED W/ [UNIQUE] . Your code inspired me to also print the "message name" as part of the log line for easier log reading. – RaamEE Aug 20 '19 at 08:06
0

The solution below implements a dict to keep a md5sum of the text message, to allow for easy check of messages similarity and to save on memory in case of long messages.

Each message that is logged is logged with a unique name identifier. The next time that this identifier is used, the new message's md5sum is compared against the md5sum of the previous log. If the message has a different md5sum, it is logged and replaces previous md5sum in the dict.

e.g.

logging.unique('task_x', f'Task X is {status}')
logging.unique('task_y', f'Task Y is {status}')

where status can be 'in progress' or 'complete' will remove any consecutive identical messages.

The above prevents cases like:

12:00:00 - "Task X is in progress"
12:00:01 - "Task Y is in progress"
12:00:10 - "Task X is in progress"
12:00:11 - "Task Y is in progress"
12:00:20 - "Task X is in progress"
12:00:30 - "Task X is complete"
12:00:31 - "Task Y is complete"

cleans the log to look like:

12:00:00 - "Task X is in progress"
12:00:30 - "Task X is complete"
12:00:01 - "Task Y is in progress"
12:00:31 - "Task Y is complete"

Code:

# https://fangpenlin.com/posts/2012/08/26/good-logging-practice-in-python/
import logging
import logging.handlers
import sys
#TODO: Replace hashlib.md5sum with hash() - see @shmee's answer
import hashlib


repetitive_messages_cached_md5sum_dict = {}


#Once STEP level is implemented logging.STEP will print
# +-------------------------------------+
# |    TEST STEP 1. - MESSAGE TEXT      |    
# +-------------------------------------+

# Custom severity STEP with value 15, which is below INFO, because this is a kind of INFO
# but you want to format it differently
# logging module is extended with custom severity and log function step()
STEP = 15
logging.STEP = STEP
logging.addLevelName(STEP, 'STEP')

def step(self, msg:str, *args, **kwargs) -> None:
    """
    Log 'msg % args' with severity 'STEP'.
    To pass exception information, use the keyword argument exc_info with a true value (TODO: Clarify this comment)
    Examples:
        logger.step('Login into system')
        logger.step(('Login with u:{user} p:{password}').format(user='user01', password='1234'))

    :param msg: The message to log
    :param args: Additional arguments to pass
    :param kwargs: Additional kwargs pairs to pass
    """
    if self.isEnabledFor(logging.STEP):
        self._log(logging.STEP, msg, args, **kwargs)

logging.Logger.step = step

# Custom severity UNIQUE with value 21, which is right above INFO, because this is a kind of INFO
# but you want to format it differently
# logging module is extended with custom severity and log function unique()
UNIQUE = 21
logging.UNIQUE = UNIQUE
logging.addLevelName(UNIQUE, 'UNIQUE')


def unique(self, msg_name, msg, level=logging.INFO, *args, **kwargs):
    """
    Log only messages that are different from previous messages. The messages are cached and keyed by the
    message name.
    Log 'msg % args' with severity 'UNIQUE'.
    To pass exception information, use the keyword argument exc_info with a true value (TODO: Clarify this comment)
    Examples:
        logger.unique('login_msg', 'Login into system')
        logger.unique('login_user', ('Login with u:{user} p:{password}').format(user='user001', password='1234'))

    :param msg_name: The name of the repeatitive message to log
    :param msg: The message to log
    :param level: default=INFO. The effective level of the message
    :param logger: Logger passed while running in PyTest
    :param args: Pass args to log function. See log function documentation for details
    :param kwargs: Pass kwargs to log function. See log function documentation for details
    :return: Boolean logged_msg: Was the message logged or not?
    """

    if self.isEnabledFor(logging.UNIQUE):
        log_msg_md5sum = hashlib.md5(msg).hexdigest()
        msg_name = f'{self.name}-{msg_name}'
        if msg_name in repetitive_messages_cached_md5sum_dict.keys():
            if log_msg_md5sum == repetitive_messages_cached_md5sum_dict[msg_name]:
                logged_msg = False
                return logged_msg

        self._log(level, f'{msg_name}: {msg}', args, **kwargs)
        repetitive_messages_cached_md5sum_dict[msg_name] = log_msg_md5sum
        logged_msg = True

    return logged_msg


logging.Logger.unique = unique


class LessThanFilter(logging.Filter):
    """
    Documentaion: https://stackoverflow.com/a/31459386/317460
    Filter to log logs between two log-levels
    """

    def __init__(self, exclusive_maximum, name="") -> None:
        super(LessThanFilter, self).__init__(name)
        self.max_level = exclusive_maximum

    def filter(self, record) -> bool:
        # non-zero return means we log this message
        return 1 if record.levelno < self.max_level else 0


class MyOwnLogger(logging.getLoggerClass()):
    __FORMAT = {
        'fmt': '%(asctime)s [%(levelname)s] - func=%(funcName)s:[%(lineno)s] - location=%(filename)s - MSG= %(message)s',
        'datefmt': '%Y-%m-%d %H:%M:%S',
        'style': '%'
    }

    def __init__(self, format=__FORMAT, level=logging.STEP) -> None:
        formatter = logging.Formatter(**format)


        self.root.setLevel(logging.INFO)
        self.root.handlers = []


        # in stdout print from level and up - in this case minimum is STEP
        streamHandler_out = logging.StreamHandler(sys.stdout)
        streamHandler_out.setLevel(level=level)
        streamHandler_out.setFormatter(formatter)
        streamHandler_out.addFilter(LessThanFilter(logging.WARNING))

        streamHandler_err = logging.StreamHandler(sys.stderr)
        streamHandler_err.setLevel(level=logging.WARNING)
        streamHandler_err.setFormatter(formatter)

        self.root.setLevel(level)
        self.root.addHandler(streamHandler_out)
        self.root.addHandler(streamHandler_err)



def string_md5sum(text):
    return hashlib.md5(text).hexdigest()
RaamEE
  • 3,017
  • 4
  • 33
  • 53
  • 1
    Few questions :) Why do you define a specific level for this functionality if you still build the log records with the standard levels? What is the `msg_name` parameter to `unique` good for? `logger.unique('a', "Task X in progress")` and `logger.unique('b', "Task X in progress")` would both log from what I see, despite the messages being equal. I'd expect both examples from the docstring in `unique` to raise a TypeError for missing a required positional argument (`msg`). What is `logging.STEP` that you use as default level in `MyOwnLogger`'s initializer. What is `MyOwnLogger` intended for? – shmee Aug 13 '19 at 09:36
  • Thanks @shmee for catching all these points in my code. I tried to post the minimum, but missed cleaning STEP, so I am adding it back. 1) msg_name is used to track the md5sum for multiple messages. In the example there is task_x and task_y 2) Corrected the docstring 3) STEP is a level I intend to use for printing a message surrounded by a text frame - This is not implemented yet and is useful for tests, I added an example comment. 4) MyOwnLogger is imported when I start my code to configure logging and add the levels STEP and UNIQUE. – RaamEE Aug 13 '19 at 13:38
  • Ok, let me rephrase :) Do you need the possibility to have the very same message more than once in your message cache using different keys? – shmee Aug 13 '19 at 15:02
  • Yes. It is not the same identical message. for key 'task_x' the message is 'Task X is in progress', while for key 'task_y' the message is 'Task Y is in progress'. I am tracking the changes in the messages for task X separately from task Y. – RaamEE Aug 18 '19 at 12:47
  • Yeah, I noticed that you don't cache the same message twice in your example. But with your approach you *could* do it. I was wondering if that was deliberate. Otherwise, if you don't need to easily access that message in your cache, you could just use a list, or a set. A dict seems a bit over the top in that case. Not having to pass a key to your function would open up some more options to implement your requirement – shmee Aug 18 '19 at 16:38
  • I modified the hash function to do set the message name as f'{self.name}-{msg}' --> This allows to cache the same message if it is logged by different logger names. So same unique message from logger name 'A' is cached separately from the message logged by logger name 'B'. This saves time on having to manage different message names for different instances, by assigning a logger name to the logger. – RaamEE Aug 20 '19 at 18:53