1

I am facing an issue using the read_fwf command from the Python library pandas, same as described in this unresolved question

I want to read an ascii file containing results of numeric computations. The file contains repetitive header blocks with a variable number of lines between each of them and with a roughly described format but rather arbitrary string content. I do not want to allow any loss of information and therefore need to be able to process everything as it was originally.

Look at this examplary line:

$SUBTITLE= 48-ELEMENT CANTILEVER BEAM ON THE X-AXIS 1181

Using a

with pd.read_fwf(self.file_path, header=None,
                         chunksize=chunk_size,  # no of lines read
                         colspecs=[(0, 23), (23, 41), (41, 59), (59, 72)],
                         engine='c', dtype={0: str}) as df_gen:

context manager to read the file, white spaces between columns will be stripped despite the seamlessly neighbouring column lengths. So it does not act as a fixed-width reader.

Cutting out the middle part of the line above with

"".join(s.dropna()).split("=")[-1].strip(" ")

with s being the row of the DataFrame containing that line, I will get a string

48-ELEMENT CANTILEVER BEAM ONTHE X-AXIS instead of

48-ELEMENT CANTILEVER BEAM ON THE X-AXIS.

Does anybody know an elegant solution to this issue? I am working on an approach where I will initially not cut into columns and do it for the numeric blocks after i extracted the header lines but it might increase computation time significantly.

Is there maybe a keyword in newer pandas versions that will disable the stripping?

TheDronist
  • 68
  • 8

1 Answers1

1

Take bigger cuts:

s = "$SUBTITLE= 48-ELEMENT CANTILEVER BEAM ON THE X-AXIS                         1181"

pd.read_fwf(StringIO(s), header=None, colspecs=[(0, 11),(11, 51),(51,80)]

Output:

            0                                         1     2
0  $SUBTITLE=  48-ELEMENT CANTILEVER BEAM ON THE X-AXIS  1181

Or you can just read in line by line:

pd.read_table(StringIO(s), header=None)

...
                                                                                  0
0  $SUBTITLE= 48-ELEMENT CANTILEVER BEAM ON THE X-AXIS                         1181

A little regex magic and you can consistently extract the inner line:

re.findall('.*= (.+?)\s\s', s)

...

['48-ELEMENT CANTILEVER BEAM ON THE X-AXIS']
BeRT2me
  • 12,699
  • 2
  • 13
  • 31
  • Thanks for your answer, I tried using bigger cuts, but the problem is that there are differently shaped headlines and in the numeric blocks, some lines start with blanks, so they will be stripped too and it becomes very costly to filter all these special cases when I split into columns later. I did not know about the read_table function. It even has the "c" engine option that helps when loading files with sizes of several 100GBs. I might make use of the regex too, but I am quite the novice when it comes to that. – TheDronist May 19 '22 at 10:05