3

The Problem:

I need a generic approach for the following problem. For one of many files, I have been able to grab a large block of text which takes the form:

                                                 Index
                         1         2         3         4         5         6
 eigenvalues:        -15.439    -1.127    -0.616    -0.616    -0.397     0.272
   1  H 1   s        0.00077  -0.03644   0.03644   0.08129  -0.00540   0.00971
   2  H 1   s        0.00894  -0.06056   0.06056   0.06085   0.04012   0.03791
   3  N     s        0.98804  -0.11806   0.11806  -0.11806   0.15166   0.03098
   4  N     s        0.09555   0.16636  -0.16636   0.16636  -0.30582  -0.67869
   5  N     px       0.00318  -0.21790  -0.50442   0.02287   0.27385   0.37400
                         7         8         9        10        11        12
 eigenvalues:          0.373     0.373     1.168     1.168     1.321     1.415
   1  H 1   s       -0.77268   0.00312  -0.00312  -0.06776   0.06776   0.69619
   2  H 1   s       -0.52651  -0.03358   0.03358   0.02777  -0.02777   0.78110
   3  N     s       -0.06684   0.06684  -0.06684  -0.01918   0.01918   0.01918
   4  N     s        0.23960  -0.23960   0.23961  -0.87672   0.87672   0.87672
   5  N     px       0.01104  -0.52127  -0.24407  -0.67837  -0.35571  -0.01102
                        13        14        15
 eigenvalues:          1.592     1.592     2.588
   1  H 1   s        0.01433   0.01433  -0.94568
   2  H 1   s       -0.18881  -0.18881   1.84419
   3  N     s        0.00813   0.00813   0.00813
   4  N     s        0.23298   0.23298   0.23299
   5  N     px      -0.08906   0.12679  -0.01711

The problem is that I need extract only the coefficients, and I need to be able to reformat the table so that the coefficients can be read in rows not columns. The resulting array would have the form:

[[0.00077, 0.00894, 0.98804, 0.09555, 0.00318]
 [-0.03644, -0.06056, -0.11806, 0.16636, -0.21790]
 [0.03644, 0.06056, 0.11806, -0.16636, -0.50442]
 [-0.00540, 0.04012, 0.15166, -0.30582, 0.27385]
 [0.00971, 0.03791, 0.03098, -0.67869, 0.37400]
 [-0.77268, -0.52651, -0.06684, 0.23960, 0.01104]
 [0.00312, -0.03358, 0.06684, -0.23960, -0.52127
 ...
 [0.01433, -0.18881, 0.00813, 0.23298, 0.12679]
 [-0.94568, 1.84419, 0.00813, 0.23299, -0.01711]]

This would be manageable for me if it wasn't for the fact that the number of columns changes with different files.


What I have tried:

I had earlier managed to get the eigenvalues by:

eigenvalues = []
with open('text', 'r+') as f:
    for n, line in enumerate(f):
        if (n >= start_section) and (n <= end_section):
            if 'eigenvalues' in line:
                eigenvalues.append(line.split()[1:])

flatten = [item for sublist in eigenvalues for item in sublist]
$ ['-15.439', '-1.127', '-0.616', '-0.616', '-0.397', '0.272', '0.373', '0.373', '1.168', '1.168', '1.321', '1.415', '1.592', '1.592', '2.588']

So attempting several variants of this, and in the most recent approach I tried:

dir = {}
with open('text', 'r+') as f:
    for n, line in enumerate(f):
        if (n >= start_section) and (n <= end_section):
            for i in range(1, number_of_coefficients+1):
                if str(i) in line.split()[0]:
                    if line.split()[1].isdigit() == False:
                        if line.split()[3] in ['s', 'px', 'py', 'pz']:
                            dir[str(i)].append(line.split()[4:])
                        else:
                            dir[str(i)].append(line.split()[3:])

Which seemed to get me close, however, I got a strange duplication of numbers in random orders. The idea was that I would then be able to convert the dictionary into the array.

Please HELP!!


EDIT: The letters in the 3rd and sometimes 4th column are also variable (changing from, s, px, py, pz).

1 Answers1

3

Here's one way to do it. This approach has a few noteworthy aspects.

First -- and this is key -- it processes the data section-by-section rather than line by line. To do that, you have to write some code to read the input lines and then yield them to the rest of the program in meaningful sections. Quite often, this preliminary step will radically simplify a parsing problem.

Second, once we have a section's worth of "rows" of coefficients, the other challenge is to reorient the data -- specifically to transpose it. I figured that someone smarter than I had already figured out a slick way to do this in Python, and StackOverflow did not disappoint.

Third, there are various ways to grab the coefficients from a section of input lines, but this type of fixed-width, report-style data output has a useful characteristic that can help with parsing: everything is vertically aligned. So rather than thinking of a clever way to grab the coefficients, we just grab the columns of interest -- line[20:].

import sys

def get_section(fh):
    # Takes an open file handle.
    # Yields each section of lines having coefficients.
    lines = []
    start = False
    for line in fh:
        if 'eigenvalues' in line:
            start = True
            if lines:
                yield lines
                lines = []
        elif start:
            lines.append(line)
            if 'px' in line:
                start = False
    if lines:
        yield lines

def main():
    coeffs = []
    with open(sys.argv[1]) as fh:
        for sect in get_section(fh):
            # Grab the rows from a section.
            rows = [
                [float(c) for c in line[20:].split()]
                for line in sect
            ]
            # Transpose them. See https://stackoverflow.com/questions/6473679
            transposed = list(map(list, zip(*rows)))
            # Add to the list-of-lists of coefficients.
            coeffs.extend(transposed)

    # Check.
    for cs in coeffs:
        print(cs)

main()

Output:

[0.00077, 0.00894, 0.98804, 0.09555, 0.00318]
[-0.03644, -0.06056, -0.11806, 0.16636, -0.2179]
[0.03644, 0.06056, 0.11806, -0.16636, -0.50442]
[0.08129, 0.06085, -0.11806, 0.16636, 0.02287]
[-0.0054, 0.04012, 0.15166, -0.30582, 0.27385]
[0.00971, 0.03791, 0.03098, -0.67869, 0.374]
[-0.77268, -0.52651, -0.06684, 0.2396, 0.01104]
[0.00312, -0.03358, 0.06684, -0.2396, -0.52127]
[-0.00312, 0.03358, -0.06684, 0.23961, -0.24407]
[-0.06776, 0.02777, -0.01918, -0.87672, -0.67837]
[0.06776, -0.02777, 0.01918, 0.87672, -0.35571]
[0.69619, 0.7811, 0.01918, 0.87672, -0.01102]
[0.01433, -0.18881, 0.00813, 0.23298, -0.08906]
[0.01433, -0.18881, 0.00813, 0.23298, 0.12679]
[-0.94568, 1.84419, 0.00813, 0.23299, -0.01711]
FMc
  • 41,963
  • 13
  • 79
  • 132
  • 1
    Imporove with `enumerate(coeffs, 1)` – stovfl Jun 19 '20 at 10:55
  • This is superb. I should have mentioned, the letters in the 3rd (sometimes 4th) column are variable and can equal `s, px, py, pz` and other values. Is there an easy fix for your solution, so it always goes to the final line (not always of length 15)? – theotheraccount Jun 19 '20 at 12:37
  • @FMC I thought about finding the spacing between the sections and then using a while loop? Thanks again for your help - super appreciated. – theotheraccount Jun 19 '20 at 12:46
  • 1
    @theotheraccount You can see all of the data and are in a better position to evaluate the most practical approach. If the `s`, `px`, etc are not reliable parsing markers, another approach would be to define each section solely based on the `eigenvalues` lines. Then write a different function to take a raw section's worth of lines and filter it down to just the lines having coefficients -- eg, maybe strip the lines and retain only those starting with a digit? That's the general idea here: divide the text into meaningful sections and then keep filtering down one simple step at a time. – FMc Jun 19 '20 at 12:47