Filter out non-numeric lines from multiple columns in pandas

Question

I have fairly big LZMA-compressed data files that I want to read using pandas to extract the minimum and maximum value of certain columns. The file was generated using grep -n from a log file of a program running under MPI, and thus contains garbled lines where multiple MPI ranks write out to stdout at the same time.

This question is very similar to this one but I need to do the same thing three times for each column. I've tried the various presented answers there to no avail.

Here is the Python script I've got so far:

import os # to check file existence
import sys # for argc, argv
import re # regex
import lzma as xz
import numpy as np
import pandas as pd

# Quick exit if file does not exist
if not os.path.exists(argv[1]):
    help();    sys.exit( 'Error: cannot read file', argv[1] );
else:
    
    # Define column names and columns to take
    cols     = [ 4,   7,   10  ];
    colnames = [ 'm', 'n', 'k' ];

    # Read file through LZMA decompressor
    ifname = argv[1];
    ifile = xz.open( ifname, 'rt' );
    data = pd.read_csv( ifile, delim_whitespace=True, \
                        usecols=cols, names=colnames, \
                        error_bad_lines=False );
    ifile.close();

    ### Insert filtering method here to transform data to data_clean
    
    mdims = data_clean['m'].to_numpy();
    mmin = np.amin(mdims);
    mmax = np.amax(mdims);
    ndims = data_clean['n'].to_numpy();
    nmin = np.amin(ndims);
    nmax = np.amax(ndims);
    kdims = data_clean['k'].to_numpy();
    kmin = np.amin(kdims);
    kmax = np.amax(kdims);

    # Display output
    print( re.sub( ifname, '.xz', '' ), ':' );
    print( 'M =', mmin, '-', mmax );
    print( 'N =', nmin, '-', nmax );
    print( 'K =', kmin, '-', kmax );

    sys.exit(0);

Here are two data files that you can test with. Any help would be appreciated.

Serge Ballesta · Accepted Answer · 2021-03-24T08:32:14.797

When it comes to data filtering, the sooner is the better...

Here I would use a converter to replace offending values with NaN at load time. That way, the filtering will only require dropna:

def convert(x):
    try:
        return np.int64(x)
    except ValueError:
        return np.nan
...
data = pd.read_csv( ifile, delim_whitespace=True, \
                    usecols=cols, names=colnames, \
                    error_bad_lines=False, \
                    converters= {k: convert for k in colnames})
data_clean = data.dropna().astype('int64')

But in fact, trying to use a csv reader is just too late. Because it in not a true csv file but it contains lines like:

793883: zgemm: m =           51  n =           449 k =          2408
793884: zgemm: m =           51  n =           449 k =          2408
793885: zgemm: m =           51  n =           449 k =          2408
793886: zgemm: m =           51  n =           449 k =          2408
793887: zgemm: m =           51  n =           449 k =          2408
793888: zgemm: m =           51  n =           449 k =          2408

So far so good, the problem is that is also contains garbled lines like

3251002: ) into (     zgemm: m =           51  n =           449 k =          2391
1735619: zgemm: m =           51  n =           449 k =          24043 x          243 
1747325: zgemm: m =           51  n =           449 k =          239          3 packing gntuju (          243 x          243

The last two line show that trying to salvage erroneous lines could lead to wrong data because some values can be truncated or concatenated with other numbers

But a regex should be enough to identify valid lines. So I would do:

...
import re
...

    ...
    pattern = r'\d+:\s\w+:\s+m\s+=\s+(\d+)\s+n\s+=\s+(\d+)\s+k\s+=\s+(\d+)\s*$'
    rx = re.compile(pattern)
    data = pd.DataFrame((m.groups() for line in ifile
                       for m in (rx.match(line),) if m),
                      columns=colnames).astype('int64')
    ...

I think this is a step in the right direction, but I'm getting weird results... For the first file, the script outputs K = 239 - 24043 (it should be both ~2400) and for the second file, the script outputs N = 0 - 85 and K = 0 - 7048 (N should be ~80 and K ~7000). One thing for sure though: the garbled lines contain a different number of columns than the rest of the data. How do I drop the entire row when one of the columns in that row contains a NaN? — wyphan, Mar 23 '21 at 20:15
@wyphan *The sooner the better*. IMHO the most relyable way is to validate the lines with a regex. Fortunately, the regex can capture the relevant data to feed a dataframe with. See my edit. — Serge Ballesta, Mar 24 '21 at 08:34
That works perfectly, thanks! Marking your answer as accepted. — wyphan, Mar 25 '21 at 00:31

Filter out non-numeric lines from multiple columns in pandas

1 Answers1