4

I would like to write a pandas dataframe to a file using a FORTRAN format string. I haven't been able to find anything online except a discussion of how this functionality would be nice. Does anyone know if this is possible?

I suppose I don't need to use a fortran format string...I just need to get the output file in a specific format that fortran can easily read.

UPDATE: For example, I have a large data file that has a specified fortran format. I load the file into my python function, manipulate the data, and then would like to export the manipulated data into a file with the same format it had originally. An example of the file format would be something like:

FORMAT (1X,F12.6,2F9.6,F11.7,T61,2F9.6,F10.7,T142,I6,1X,A2,T236,A1)

The reason I need to export the data in a specific format is because the output file will be read directly into a well-established fortran code (meaning the fortran code cannot be altered).

francescalus
  • 30,576
  • 16
  • 61
  • 96
WillaB
  • 420
  • 5
  • 12
  • I don't know what a pandas format string looks like... – Ross Jun 18 '15 at 00:39
  • why dont you just use csv? – agentp Jun 18 '15 at 01:31
  • I think WillaB is asking for tabular output. Each variable should start in the same column for each row whereas with CSV it can vary depending on the values. But maybe this can be done with CSV and formats? I'm not sure. @WillaB -- sample data and desired output would make this more clear. – JohnE Jun 18 '15 at 01:36
  • fortran does not need such column structure. – agentp Jun 18 '15 at 01:42
  • No one said fortran needed column structure but it is a common way to do it in fortran. I think it's a reasonable way to approach the question, but not the only way, and you can take whatever approach you like. – JohnE Jun 18 '15 at 02:20
  • 1
    @JohnE, I am asking for tabular output. Thanks for the clarification. I've included an example format I'm looking for above. – WillaB Jun 18 '15 at 23:00

4 Answers4

2

Here's a nice tidy solution that uses the fortranformat package (pip install fotranformat, https://pypi.org/project/fortranformat/) and df.apply() that let's you use a standard fortran format string:

import fortranformat as ff
import pandas as pd 

df = pd.DataFrame({
        'sampleId': ['A','B','C','D'],        
        'var1' : [0.002,0.004,0.006,0.002],
        'var2' : [1.2,1.4,1.6,1.2],
        'Nobs': [32,12,9,30]
    })

format_string = '(a5, f8.3, f8.1, i5)'
header_line = ff.FortranRecordWriter(format_string)
Formatted_df = df.apply(lambda x : header_line.write(x.values),axis=1)

The Formatted_df object will be a Series with a string element for each row of the dataframe:

>>> print(Formatted_df)
0        A   0.002     1.2   32
1        B   0.004     1.4   12
2        C   0.006     1.6    9
3        D   0.002     1.2   30
dtype: object
>>> print(Formatted_df.loc[0])
    A   0.002     1.2   32
>>> print(type(Formatted_df.loc[0]))
<class 'str'>

To write it to file you can then just use to_csv:

Formatted_df.to_csv('formatted_df.csv',index=False,header=False)

Note that this won't include any column names, so you may wish to initialize the output file then append to it:

output_fi='formatted_df.csv'
col_names=df.columns.tolist()
with open(output_fi,'w') as outfi: 
    outfi.write('# '+' '.join(col_names)+"\n")
    outfi.write('# '+format_string+"\n")
    
Formatted_df.to_csv(output_fi,mode='a',index=False,header=False)

Also note that this assumes you know the ordering of your dataframe columns already.

ALSO note that you may run into memory issues if you're dealing with very large dataframes as Formatted_df will be a complete copy of df. If that's the case, you'll have to chunk it up!

chris
  • 1,267
  • 7
  • 20
1

Update:

I would now do this sort of thing in two steps:

  • Step 1 -- Convert from pandas dataframe to numpy array or rec-array. This is trivial via the values or to_numpy methods. It's a little trickier if you have strings but see here for one technique. If you have simple numeric data (and no strings), just stick to a regular numpy array and don't bother with a rec-array or structured array.

  • Step 2 -- use numpy's tofile to write out a Fortran-readable binary

Original Answer:

I guess the bigger question is how to output from pandas to fortran and I'm not sure of the best way, but I'll try to show some fairly simple solutions mainly with to_csv().

Doing this will always give you faster IO, and I actually find binary easier than text in this case, although you do lose the ability to view the data as text.

df = pd.DataFrame({ 'x':[1.03,2.9,3.7],'y':[1,22,5] })

      x   y
0  1.03   1
1  2.90  22  
2  3.70   5

Standard pandas output is actually exactly what you are asking for here, but I'm not sure how to get that into a file except with copy and paste. Maybe there is a way with ipython (though not that I can find).

And here's some default csv output, which is obviously not columnar:

df.to_csv('foo.csv',index=False)

%more foo.csv
x,y
1.03,1
2.9,22
3.7,5

But you may be able to get this into fortran with list directed input.

If you can live with the same format for all numbers, you could do something like this:

df.astype(float).to_csv('foo.raw',index=False,float_format='%10.5f')

%more foo.raw
x,y
   1.03000,   1.00000
   2.90000,  22.00000
   3.70000,   5.00000

A couple notes here: that's not bad but limited in forcing you to use the same format for all numbers, which is pretty wasteful for single digit integers, for example. Also, I tried this with some NaNs and that didn't work very well. And also the commas are not needed there but when I tried to change the separator to ' ', then it quoted everything, so I just left it out.

Finally, the most flexible way might be to convert to strings and format them. This gives you some flexibility to format each column individually. Here's a simple example using a right justified format (and width of 8 for 'x' and 4 for 'y'):

df.x = df.x.map('{:>8}'.format)
df.y = df.y.map('{:>4}'.format)
df.to_csv('foo.str',index=False)

%more foo.str
x,y
    1.03,   1
     2.9,  22
     3.7,   5

I still can't figure out how to get rid of those commas, but this way does handle NaNs successfully.

JohnE
  • 29,156
  • 8
  • 79
  • 109
  • Thanks for these attempts. Unfortunately, each column needs to be treated individually (I can't have the same format for all columns). I also need to be able to write out semi-large data files automatically, as they're going straight from my function into a well-established fortran code. I'm going to play around with your last suggestion, and see if I can get rid of the commas. – WillaB Jun 18 '15 at 23:06
  • @WillaB -- the commas are annoying, but shouldn't be a problem with the fortran formats, right? Except for wasting some space. Alternatively, I realize it's not ideal, but it's just a one-liner in a shell script or one command in a text-editor to remove the commas, that's a quick and dirty way. If you really want them out via pandas, you might just want to make a followup question here focusing on that one thing. There must be some way, but I tried a few options and none of them worked.... – JohnE Jun 19 '15 at 00:04
  • the commas could be problematic if they _create_ an additional column between two variable outputs. For example, if the first 3 variables were the day, month, year a comma could be problematic if they variables are not separated by a space in the fortran code's format: `21112015` vs `21,11,2015` – WillaB Jun 22 '15 at 20:15
  • @WillaB But you can tell fortran which of the two cases it is. `2i2,i4` vs `2(i2,x),i4`. The former syntax is certainly a little simpler, but skipping nuisance columns is certainly possible with basic fortran input. – JohnE Jun 22 '15 at 20:48
  • But then you're making the assumption that the fortran code format statement can be modified. I _may_ be able to modify the code for my purposes here, but it is not an ideal solution. It seems like pandas _should_ have a straightforward method for forcing a specific format. It's surprising to me that it doesn't. Thanks again for your help. – WillaB Jun 22 '15 at 21:34
  • @WillaB I basically agree. You may want to post a followup question on this specifically. I'm almost inclined to do it myself just out of curiosity but am probably too lazy as it's not a practical issue for me. To be clear, pandas does allow you to force a specific format here, it's just that apparently neither of us can figure out how to not have nuisance commas (or quotes) inserted at the same time. – JohnE Jun 22 '15 at 21:46
  • Fair enough. I haven't had an opportunity to work with the solution you provided, so I may do that before posting a followup question. – WillaB Jun 22 '15 at 22:03
1

I know it's not ideal, but the way I've resorted to doing this is by printing out each record line by line:

df = pd.DataFrame({'alt':[1435.2, 1234.7], 'lat':[0.145, 0.324], 'lon':[12.45, 12.23]})

with open('flight.trk', 'w') as f:
    f.write("! Alt Lat Lon\n")
    for ix, alt, lat, lon in df.itertuples():
        f.write("{:10.2f} {:9.4f} {:9.4f}\n".format(alt, lat, lon))

Note that I've used itertuples to iterate over the rows as suggested here (What is the most efficient way to loop through dataframes with pandas?) but that relies on knowing the order of your columns (which are alphabetical in this case).

I've used this for tables longer than 10,000 lines and it's actually reasonably quick in my experience though I've not done any rigorous timing experiments.

Community
  • 1
  • 1
Duncan WP
  • 830
  • 5
  • 19
0

Coming on this much later, here is a solution which worked in my situation (myDF having a first column of strings, then 2 columns of floats). Supposing you need to read something in Fortran with

FORMAT (A19,F11.6,F11.6)

Then (with import numpy as np):

np.savetxt(myfile, myDF.to_numpy(), fmt="%19s %10.6f %10.6f")

Or (to avoid the additional space, but less clear):

np.savetxt(myfile, myDF.to_numpy(), fmt="%19s%11.6f%11.6f") 

(Just in case it helps someone else ;-))

ztl
  • 2,512
  • 1
  • 26
  • 40