0

I use python to connect multiple processing tools for NLP tasks together but also capture the output of each in case something fails and write it to a log.

Some tools need many hours and output their current status as a progress percentage with carriage returns (\r). They do many steps, so they mix normal messages and progress messages. That results in sometimes really large log files that are hard to view with less. My log will look like this (for fast progresses):

[DEBUG  ] [FILE] [OUT] ^M4% done^M8% done^M12% done^M15% done^M19% done^M23% done^M27% done^M31% done^M35% done^M38% done^M42% done^M46% done^M50% done^M54% done^M58% done^M62% done^M65% done^M69% done^M73% done^M77% done^M81% done^M85% done^M88% done^M92% done^M96% done^M100% doneFinished

What I want is an easy way to collapse those strings in python. (I guess it is also possible to do this after the pipeline is finished and replace progress messages with e. g. sed ...)

My code for running and capturing the output looks like this:

import subprocess
from tempfile import NamedTemporaryFile

def run_command_of(command):
    try:
        out_file = NamedTemporaryFile(mode='w+b', delete=False, suffix='out')
        err_file = NamedTemporaryFile(mode='w+b', delete=False, suffix='err')
        debug('Redirecting command output to temp files ...', \
            'out =', out_file.name, ', err =', err_file.name)

        p = subprocess.Popen(command, shell=True, \
                             stdout=out_file, stderr=err_file)
        p.communicate()
        status = p.returncode

        def fr_gen(file):
            debug('Reading from %s ...' % file.name)
            file.seek(0)
            for line in file:
                # TODO: UnicodeDecodeError?
                # reload(sys)
                # sys.setdefaultencoding('utf-8')
                # unicode(line, 'utf-8')
                # no decoding ...
                yield line.decode('utf-8', errors='replace').rstrip()
            debug('Closing temp file %s' % file.name)
            file.close()
            os.unlink(file.name)
        return (fr_gen(out_file), fr_gen(err_file), status)
    except:
        from sys import exc_info
        error('Error while running command', command, exc_info()[0], exc_info()[1])
        return (None, None, 1)

def execute(command, check_retcode_null=False):
    debug('run command:', command)
    out, err, status = run_command_of(command)
    debug('-> exit status:', status)

    if out is not None:
        is_empty = True
        for line in out:
            is_empty = False
            debug('[FILE]', '[OUT]', line.encode('utf-8', errors='replace'))
        if is_empty:
            debug('execute: no output')
    else:
        debug('execute: no output?')
    if err is not None:
        is_empty = True
        for line in err:
            is_empty = False
            debug('[FILE]', '[ERR]', line.encode('utf-8', errors='replace'))
        if is_empty:
            debug('execute: no error-output')
    else:
        debug('execute: no error-output?')

    if check_retcode_null:
        return status == 0
    return True

It is some older code in Python 2 (funny time with unicode strings) that I want to rewrite to Python 3 and improve upon. (I'm also open for suggestions in how to process the output in realtime and not when everything is finished. update: is too broad and not exactly part of my problem)

I can think of many approaches but do not know if there is a ready-to-use function/library/etc. but I could not find any. (My google-fu needs work.) The only things I found were ways to remove the CR/LF but not the string portion that gets visually replaced. So I'm open for suggestions and improvements before I invest my time in reimplementing the wheel. ;-)

My approach would be to use a regex to find sections in a string/line between \r and remove them. Optionally I would keep a single percentage value for really long processes. Something like \r([^\r]*\r).


Note: A possible duplicate of: How to pull the output of the most recent terminal command? It may require a wrapper script. It can still be used to convert my old log files with script2log. I found/got a suggestion for a plain python way that fulfills my needs.

E. Körner
  • 122
  • 3
  • 9
  • This seems very broad and at the same time I'm thinking maybe you are simply looking for `variable.split('\r')`. Maybe see also https://stackoverflow.com/a/51950538/874188 – tripleee Jan 26 '19 at 14:18
  • The link was really informative. But I currently have a way to capture the output but I want the final text in my log that I would see in the console. That means I don't want the text and CRs. I guess splitting on CR and only using the last of the splits will do. But I had hoped there was a _even easier_ way to do this. – E. Körner Jan 26 '19 at 14:22
  • The main challenge is probably buffering. Many OSes will line-buffer output and CR usually does not work as a line terminator. See if you can force the subprocess to be unbuffered. Maybe see also [the Stack Overflow `subprocess` tag info page.](/tags/subprocess/info) – tripleee Jan 26 '19 at 14:30
  • Possible duplicate of [How to pull the output of the most recent terminal command?](https://stackoverflow.com/questions/49013526/how-to-pull-the-output-of-the-most-recent-terminal-command) – Thomas Dickey Jan 26 '19 at 15:39
  • @Thomas Dickey I have to check this out before I will mark it as duplicate but it seems like it would be the solution to my problem. I may have to wrap the tools and params but it may work (even if not plain python). `script2log` looks promising. I still have doubt here: `sed` with `s/^M^M*$//g s/^.*^M//g` if the previous line is longer than the newer overwriting one, `sed` may delete more than what would be visible in a console? But it seems more like an edge case. (Those two expressions can also easily be used in python) – E. Körner Jan 26 '19 at 18:42
  • @tripleee thanks for the subprocess stackoverflow page. Did not know/remember it but it will really help me. (Easier than googling for something unknown/hard to describe topics) – E. Körner Jan 26 '19 at 18:46

1 Answers1

0

I think the solution for my use case is as simple as this snippet:

# my data
segments = ['abcdef', '567', '1234', 'xy', '\n']
s = '\r'.join(segments)

# really fast approach:
last = s.rstrip().split('\r')[-1]

# or: simulate the overwrites
parts = s.rstrip().split('\r')
last = parts[-1]
last_len = len(last)
for part in reversed(parts):
    if len(part) > last_len:
        last = last + part[last_len:]
        last_len = len(last)

# result
print(last)

Thanks to the comments to my question, I could better/further refine my requirements. In my case the only control characters are carriage returns (CR, \r) and a rather simple solution works as tripleee suggested.

Why not simply the last part after \r? The output of

echo -e "abcd\r12"

can result in:

12cd

The questions under the subprocess tag (also suggested in a comment from tripleee) should help for realtime/interleaved output but are outside of my current focus. I will have to test the best approach. I was already using stdbuf for switching the buffering when needed.

E. Körner
  • 122
  • 3
  • 9