I have fairly big LZMA-compressed data files that I want to read using pandas to extract the minimum and maximum value of certain columns. The file was generated using grep -n
from a log file of a program running under MPI, and thus contains garbled lines where multiple MPI ranks write out to stdout at the same time.
This question is very similar to this one but I need to do the same thing three times for each column. I've tried the various presented answers there to no avail.
Here is the Python script I've got so far:
import os # to check file existence
import sys # for argc, argv
import re # regex
import lzma as xz
import numpy as np
import pandas as pd
# Quick exit if file does not exist
if not os.path.exists(argv[1]):
help(); sys.exit( 'Error: cannot read file', argv[1] );
else:
# Define column names and columns to take
cols = [ 4, 7, 10 ];
colnames = [ 'm', 'n', 'k' ];
# Read file through LZMA decompressor
ifname = argv[1];
ifile = xz.open( ifname, 'rt' );
data = pd.read_csv( ifile, delim_whitespace=True, \
usecols=cols, names=colnames, \
error_bad_lines=False );
ifile.close();
### Insert filtering method here to transform data to data_clean
mdims = data_clean['m'].to_numpy();
mmin = np.amin(mdims);
mmax = np.amax(mdims);
ndims = data_clean['n'].to_numpy();
nmin = np.amin(ndims);
nmax = np.amax(ndims);
kdims = data_clean['k'].to_numpy();
kmin = np.amin(kdims);
kmax = np.amax(kdims);
# Display output
print( re.sub( ifname, '.xz', '' ), ':' );
print( 'M =', mmin, '-', mmax );
print( 'N =', nmin, '-', nmax );
print( 'K =', kmin, '-', kmax );
sys.exit(0);
Here are two data files that you can test with. Any help would be appreciated.