Read sparse matrix from ascii file into python

Question

I have a Fortran code using the parallel PETSc sparse matrix format mpiaij.

I want to do some analysis of these matrices so I want to read them into python.

I tried the binary output in Fortran and binary input from petsc4py but apparently they are not compatible. The Petsc HDF5 output creates HDF5 files that are not readable so I am stuck for now with the ASCII format.

In ascii the matrices look like:

Mat Object: 48 MPI processes
  type: mpiaij
row 0: (0, 0.934865)  (1, 0.00582401)  (2, -0.00125881)  (3, 0.000157352)  (10, 0.0212704)  (11, -9.37151e-05)  (12, 7.77296e-06)  (13, 1.15276e-06)  (20, -0.00457321)  (21, 9.31045e-06)  (22, -1.37541e-07)  (23, -3.00994e-07)  (30, 0.000571716)  (31, 5.82622e-07)  (32, -2.27908e-07)  (33, 4.55904e-08)  (3410, 0.0005718)  (3411, 3.14914e-06)  (3412, -5.83246e-07)  (3413, 5.58045e-08)  (3420, -0.00457491)  (3421, -3.91645e-05)  (3422, 6.62677e-06)  (3423, -5.10165e-07)  (3430, 0.0212818)  (3431, 0.000230778)  (3432, -3.75686e-05)  (3433, 2.57173e-06) 
row 1: (...)

Is there an elegant way to parse this into python?

score 0 · Answer 1 · answered Aug 10 '19 at 13:36

I'm not familiar with the PETSc or its matrix formats, but given the example ASCII format it's certainly possible to convert this to any other matrix format in Python. I assume that the file contains a line for each non-zero row, and that the number pairs in each row are the column index and the corresponding number. Is that correct?

What you consider "an elegant way" is a personal opinion and not really a valid question for Stack Overflow, but I can try to point you in the right direction of a working solution.

First of all, without knowing all the details, it seems to me that the right question would be "why are the binary output in Fortran and the binary input in petsc4py not compatible?" If you can solve that, that will probably be the best solution. If I remember correctly, Fortran code supports different byte orders, and might be using big endian format by default, while Python normally uses little endian format. Maybe you can specify the byte order in one of the library functions, or you could manually convert the byte order if necessary. This is something you might want to look into first.

As a work-around, you could parse the ASCII format in Python for further processing. I assume you already have searched for existing libraries and could not find any, so you need to write some custom code. Depending on your needs, a "nice" solution would use regular expressions, but a quick-and-dirty way is using standard string methods and the eval() function, since the ASCII format already closely resembles the Python syntax :-)

NOTE: Only use the eval() function if you trust the input file, since it is vulnerable to code injection attacks! For personal use, this is normally not a problem.

I've provided some example code below. This does the basic input processing. What you want to do with the data is up to you, so you'll need to finish the code yourself. This example code just prints the numbers.

def read_mpiaij(file):
    lines = file.read().splitlines()
    assert 'Mat Object: ' in lines[0]
    assert lines[1] == '  type: mpiaij'
    for line in lines[2:]:
        parts = line.split(': ')
        assert len(parts) == 2
        assert parts[0].startswith('row ')

        row_index = int(parts[0][4:])
        row_contents = eval(parts[1].replace(')  (', '), ('))

        # Here you have the row_index and a tuple of (column_index, value)
        # pairs that specify the non-zero contents. You could process this
        # depending on your needs, e.g. store the values in an array.
        for (col_index, value) in row_contents:
            print('row %d, col %d: %s' % (row_index, col_index, value))
            # TODO: Implement real code here.
            # You probably want to do something like:
            # data[row_index][col_index] = value


def main():
    with open('input.txt', 'rt', encoding='ascii') as file:
        read_mpiaij(file)


if __name__ == '__main__':
    main()

Output:

row 0, col 0: 0.934865
row 0, col 1: 0.00582401
row 0, col 2: -0.00125881
row 0, col 3: 0.000157352
row 0, col 10: 0.0212704
row 0, col 11: -9.37151e-05
row 0, col 12: 7.77296e-06
row 0, col 13: 1.15276e-06
row 0, col 20: -0.00457321
row 0, col 21: 9.31045e-06
row 0, col 22: -1.37541e-07
row 0, col 23: -3.00994e-07
row 0, col 30: 0.000571716
row 0, col 31: 5.82622e-07
row 0, col 32: -2.27908e-07
row 0, col 33: 4.55904e-08
row 0, col 3410: 0.0005718
row 0, col 3411: 3.14914e-06
row 0, col 3412: -5.83246e-07
row 0, col 3413: 5.58045e-08
row 0, col 3420: -0.00457491
row 0, col 3421: -3.91645e-05
row 0, col 3422: 6.62677e-06
row 0, col 3423: -5.10165e-07
row 0, col 3430: 0.0212818
row 0, col 3431: 0.000230778
row 0, col 3432: -3.75686e-05
row 0, col 3433: 2.57173e-06
...

TextGeek · Answer 2 · 2019-08-10T14:09:00.133

Regexes are your friend. How about something like:

for recnum, rec in enumerate(fh.readlines()):
    mat = re.match(r'row\s*(\d+):\s*(.*)', rec)
    if (not mat): raise IOError("Bad data at rec %d." % (recnum))
    rowNum = int(mat.group(1))
    rest = mat.group(2)
    lastColNum = -1
    for col in re.finditer(r'\(\d+),\s*(\d+\.\d*\)', rest):
        colNum = int(mat.group(1))
        if (colNum <= lastColNum):
            raise KeyError("colNum out of order at rec %d." % (colNum, recNum))
        value = float(mat.group(2))
        # save cell, like via numpy tbl[rowNum, colNum] = value

I assumed that the column items in each row are in order. If not, or if there are other constraints (for example if values must be in 0.0...1.0, which seems true in your example), you can of course adjust. It's worth checking the data, because data is rarely as clean as one hopes....

Read sparse matrix from ascii file into python

2 Answers2