Parse a variable row in python

Question

Coming from this link: Splitlines in Python a table with empty spaces

It works well but there is a problem when the size of the columns change:

COMMAND     PID       USER   FD      TYPE DEVICE  SIZE/OFF   NODE NAME
init          1       root  cwd   unknown                         /proc/1/cwd (readlink: Permission denied)
init          1       root  rtd   unknown                         /proc/1/root

And the problem starts in col Device or Size/OFF but maybe in other situations could happen in all columns.

COMMAND     PID       USER   FD      TYPE             DEVICE  SIZE/OFF       NODE NAME
init          1       root  cwd       DIR                8,1      4096          2 /
init          1       root  rtd       DIR                8,1      4096          2 /
init          1       root  txt       REG                8,1     36992     139325 /sbin/init
init          1       root  mem       REG                8,1     14696     190970 /lib/libdl-2.11.3.so
init          1       root  mem       REG                8,1   1437064     190958 /lib/libc-2.11.3.so
python    30077     carlos    1u      CHR                1,3       0t0        700 /dev/null

Checking always is the same in the first row, the first column starts in C of COMMAND, second ends in D of PID, the four col. in D +1 of FD.... is there any way to count the number of spaces in the first row to use them to fill this code to parse the other rows?

# note: variable-length NAME field at the end intentionally omitted
base_format = '8s 1x 6s 1x 10s 1x 4s 1x 9s 1x 6s 1x 9s 1x 6s 1x'
base_format_size = struct.calcsize(base_format)

Any ideas how to solve the problem?

We could solve this pretty easily with a regex or split if there was a quick way to get lsof to put a character representing no data where it would put whitespace otherwise. — Tim Wilder, Nov 30 '13 at 18:32

score 2 · Accepted Answer · answered Nov 30 '13 at 19:14

I did a bit of reading on lsof -F after checking out the other thread and found that it does produce easily parsed output. Here's a quick demonstration of the general idea. It parses that and prints a small subset of the parsed output to show format. Are you able to use -F for your use case?

import subprocess
import copy
import pprint


def get_rows(output_to_parse, whitelist_keys):
    lines = output_to_parse.split("\n")
    rows = []
    while lines:
        row = _get_new_row(lines, whitelist_keys)
        rows.append(row)
    return rows


def _get_new_row(lines, whitelist_keys):
    new_row_keys = set()
    output = {}
    repeat = False
    while lines and repeat is False:
        line = lines.pop()
        if line == '':
            continue
        key = line[0]
        if key not in whitelist_keys:
            raise(ValueError(key))
        value = line[1:]
        if key not in new_row_keys:
            new_row_keys.add(key)
            output[key] = value
        else:
            repeat = True
    return output

if __name__ == "__main__":
    identifiers = subprocess.Popen(["lsof", "-F", "?"], stderr=subprocess.PIPE).communicate()

    keys = set([line.strip()[0] for line in identifiers[1].split("\n") if line != ''][1:])

    lsof_output = subprocess.check_output(["lsof", "-F"])
    rows = get_rows(lsof_output, whitelist_keys=keys)
    pprint.pprint(rows[:20])

+1 for the `-F` idea. `lsof` may be slow. It might make sense to [parse its output incrementally](http://stackoverflow.com/a/20386567/4279) — jfs, Dec 04 '13 at 21:42

score 1 · Answer 2 · edited May 23 '17 at 12:25

As @Tim Wilder said, you could use lsof -F to get machine-readable output. Here's a script that converts lsof output into json. One json object per line. It produces output as soon as pipe buffers are full without waiting for the whole lsof process to end (it takes a while on my system):

#!/usr/bin/env python
import json
import locale
from collections import OrderedDict
from subprocess import Popen, PIPE

encoding = locale.getpreferredencoding(True) # for json

# define fields we are intersted in, see `lsof -F?`
# http://www.opensource.apple.com/source/lsof/lsof-12/lsof/lsof_fields.h
ids = 'cpLftsn'
headers = 'COMMAND PID USER FD TYPE SIZE NAME'.split() # see lsof(8)
names = OrderedDict(zip(ids, headers)) # preserve order

# use '\0' byte as a field terminator and '\n' to separate each process/file set
p = Popen(["lsof", "-F{}0".format(''.join(names))], stdout=PIPE, bufsize=-1)
for line in p.stdout: # each line is a process or a file set
    #  id -> field
    fields = {f[:1].decode('ascii', 'strict'): f[1:].decode(encoding)
              for f in line.split(b'\0') if f.rstrip(b'\n')}
    if 'p' in fields: # process set
        process_info = fields # start new process set
    elif 'f' in fields: # file set
        fields.update(process_info) # add process info (they don't intersect)
        result = OrderedDict((name, fields.get(id))
                             for id, name in names.items())
        print(json.dumps(result)) # one object per line
    else:
        assert 0, 'either p or f ids should be present in a set'
p.communicate() # close stream, wait for the child to exit

The field names such as COMMAND are described in the lsof(8) man page. To get the full list of available field ids, run lsof -F? or see lsof_fields.h.

Fields that are not available are set to null. You could omit them instead. I've used OrderedDict to preserve order from run to run. sorted_keys parameter for json.dumps could be used instead. To pretty print, you could use indent parameter.

lsof converts non-printable (in current locale) characters using special encoding. It makes some values ambiguous.

This helped me a lot. Thank you. It returns all data as if I just call lsof in contrast to another answer which returns much less data. — wolfroma, Sep 08 '16 at 18:44

Parse a variable row in python

2 Answers2