0

I want to parse a flat file that looks like this in python;

  Element ID     Element Type     Result       Jacobian Sign    

============== ================= ========= =====================
      1            Parabolic      Warning          1.000000     
                  Hexahedron                                    
      2            Parabolic      Warning          1.000000     
                  Hexahedron                                    
      3            Parabolic      Warning          1.000000     
                  Hexahedron                                    
      4            Parabolic      Warning          1.000000     

I tried using the mechanism used in this answer as follows;

import pandas as pd

def parse_file(file):
    col_spec = [(0, 15), (16, 33), (34, 43), (44, 65)]
    return pd.read_fwf(file, colspecs=col_spec)

But it reads one record for the top row and one row which is empty apart from the word 'Hexahedron' as the element type.

>>> data = parse_file("example.txt")
>>> data.head()
       Element ID      Element Type    Result         Jacobian Sign
0             NaN               NaN       NaN                   NaN
1  ==============  ================  ========  ====================
2               1         Parabolic   Warning              1.000000
3             NaN        Hexahedron       NaN                   NaN <= Extra record
4               2         Parabolic   Warning              1.000000

As you can see from lines, the first two rows are captured as 2 records (records 2 and 3). I want the parser to capture the first two rows as one record, so that the phrase 'Parabolic Hexahedron' is captured as the element type. How can I do this?

Steztric
  • 2,832
  • 2
  • 24
  • 43

1 Answers1

1

some post-processing should do the trick. Here is some code using the shift operator. Also notice that opening the file is not required, just pass the filename to pd.read_fwf.

import pandas as pd

col_spec = [(0, 15), (15, 32), (32, 42), (43, 65)]
df = pd.read_fwf("example.txt", colspecs=col_spec, comment="=")

# combine rows
df["combined"] = (df['Element Type'] + df['Element Type'].shift(-1)).where(df['Element ID'].notnull(), df['Element Type'] )
# remove extra rows
df = df[df['Element ID'].notnull()]

this should give a DataFrame that looks like this:

  Element ID Element Type   Result Jacobian Sign             combined
2          1    Parabolic  Warning      1.000000  ParabolicHexahedron
4          2    Parabolic  Warning      1.000000  ParabolicHexahedron
6          3    Parabolic  Warning      1.000000  ParabolicHexahedron
8          4    Parabolic  Warning      1.000000  ParabolicHexahedron
uuazed
  • 879
  • 10
  • 19
  • Awesome! Thanks. Be careful about using + operator to concatenate fields. The documentation states that it is deprecated: http://pandas.pydata.org/pandas-docs/stable/dsintro.html#data-alignment-and-arithmetic – Steztric Nov 13 '17 at 13:58
  • Thanks for the hint about the file. I actually open the file first, read forward to a marker, then start parsing. This is just left over from me sanitising it for this article. – Steztric Nov 13 '17 at 14:00
  • 1
    you could avoid some of the extra work using `skip_blank_lines=True` and `comment="="` as arguments to `read_fwf` – avigil Nov 13 '17 at 14:12