How to retrieve data from an irregularly formatted text database in Python?

Question

I'm working on code to calculate various thermodynamic properties of a given set of molecules. To do so, I have to plug in 9 coefficients into a set of equations to get the desired values. These coefficients, which vary from molecule to molecule, are retrieved from the NASA Thermobuild database, which has the following format:

C2Cl4 Tetrachloroethylene HF298=-5.034 kcal Burcat G3B3
3 T05/08 C 2.00CL 4.00 0.00 0.00 0.00 0 165.8322000 -21064.348 50.000 200.000 7 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 0.0 19563.551 -5.821898980D+03 4.158580080D+02-7.790140830D+00 1.615966138D-01 -6.791370520D-04 1.598431875D-06-1.556882412D-09 0.000000000D+00-6.205198010D+03 5.774956220D+01 200.000 1000.000 7 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 0.0 19563.551

4.940446670D+04 -1.030763621D+03 1.098508036D+01 1.645945662D-02-2.178412229D-05 1.410593520D-08-3.663931630D-12 0.000000000D+00 -3.353235260D+02-2.878634227D+01 1000.000 6000.000 7 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 0.0 19563.551 -3.067008915D+05-1.128336557D+03 1.681089243D+01-3.159107946D-04 6.850908950D-08 -7.749796920D-12 3.556100470D-16 0.000000000D+00-1.944193938D+03-5.966771040D+01

The specific numbers I need for the calculations are in bold.

(alternatively, in codeblock form so it's a bit neater and closer to the actual arrangement in the database .txt file)

C2Cl4 Tetrachloroethylene  HF298=-5.034 kcal Burcat G3B3                         
3 T05/08 C   2.00CL  4.00    0.00    0.00    0.00 0  165.8322000     -21064.348
 50.000   200.000 7 -2.0 -1.0  0.0  1.0  2.0  3.0  4.0  0.0        19563.551
-5.821898980D+03 4.158580080D+02-7.790140830D+00 1.615966138D-01-6.791370520D-04
 1.598431875D-06-1.556882412D-09 0.000000000D+00-6.205198010D+03 5.774956220D+01
 200.000  1000.000 7 -2.0 -1.0  0.0  1.0  2.0  3.0  4.0  0.0        19563.551
 4.940446670D+04-1.030763621D+03 1.098508036D+01 1.645945662D-02-2.178412229D-05
 1.410593520D-08-3.663931630D-12 0.000000000D+00-3.353235260D+02-2.878634227D+01
 1000.000  6000.000 7 -2.0 -1.0  0.0  1.0  2.0  3.0  4.0  0.0        19563.551
-3.067008915D+05-1.128336557D+03 1.681089243D+01-3.159107946D-04 6.850908950D-08
-7.749796920D-12 3.556100470D-16 0.000000000D+00-1.944193938D+03-5.966771040D+01

The database has hundreds of molecules in it, but I only need the coefficients for about 50 or so, I need a function that will go through, find the molecular species I need from a pre-written list, then pick out each coefficient and return them so I can use them in my calculations (and convert the "D+0%N" to "E+0%N"- I'm not sure why this database uses D's instead of E's to represent scientific notation).

I'm not at all familiar with SQL, so I've just been focusing on basic Python search functions. What I have so far is this:

import pandas as pd
import csv
import math
import numpy as np
species_list=[]
species=pd.read_table('Species list.txt') #list of molecular species I need coefficients for
species_temp=species['Species']
for i in range(len(species_temp)):
    species_list.append(species_temp[i])
with open('NEWNASA.TXT','rt') as database: #loads massive coefficient database
    for species_name in species_list:
        species_name=species_name+" " #to avoid returning ionic forms
            for line in database:
                if species_name in line:
                print line #test to see if it's working

However, a) this stops working after finding the first molecular species, and b) I'm still not sure how to tell the code to find the specific coefficients I need for the calculations. I'm figuring it'll involve regular expressions (which I don't have much experience with, either) and indexing, but that's as far as I've gotten. Any pointers or suggestions would be much appreciated!

Thanks!

Because those are the specific coefficients I need for the temperature range I'm doing the calculations at. I'm just not sure how to get Python to return those particular coefficients. — Tessa, May 09 '19 at 17:55
I guess I was really asking how to identify those specific fields from each record. Is here document describing the record's structure? — wwii, May 10 '19 at 14:51
There is! https://www.grc.nasa.gov/www/CEAWeb/def_formats.htm Basically, the ones I'm after are in the mid-temperature range. — Tessa, May 11 '19 at 16:16
In your example record, the first line contains the species formula and name. In the description of the record format you linked to the table suggests the first line contains either the name **or** the formula. Does that line actually contain both? — wwii, May 16 '19 at 14:19
Also the table of the record format says that the name or formula is found in the the first 17 characters of the first line **but** the name in your example ends at the 26th character. Is there a newer record format definition? — wwii, May 16 '19 at 14:26

score 1 · Answer 1 · answered May 09 '19 at 04:16

1

An opened file (database) is a one-time iterator. You cannot traverse it multiple times. The solution is to swap the for loops -- or load all of the file's lines into a list if the file is not too huge.

for line in database:
    for species_name in species_list:
        species_name = species_name + " "
        if species_name in line:
            print line

answered May 09 '19 at 04:16

FMc

41,963
13
79
132

Hmmm, that causes it to print out the *entire* database. – Tessa May 11 '19 at 17:55

score 1 · Answer 2 · answered May 16 '19 at 16:03

I'll address the question of extracting the data you want from a record in your text database.

Once you find a record you are interested in (if species_name in line:) you need to advance to the seventh and eighth lines of that record and extract the coefficients.

The record format indicates that each line is 80 characters long and each number you are interested in is 16 characters long. So split the seventh and eighth lines into five equal parts (Split a string to even sized chunks) and make floats of them

Setup:

import io

r = '''C2Cl4 Tetrachloroethylene  HF298=-5.034 kcal Burcat G3B3                         
3 T05/08 C   2.00CL  4.00    0.00    0.00    0.00 0  165.8322000     -21064.348
 50.000   200.000 7 -2.0 -1.0  0.0  1.0  2.0  3.0  4.0  0.0        19563.551
-5.821898980D+03 4.158580080D+02-7.790140830D+00 1.615966138D-01-6.791370520D-04
 1.598431875D-06-1.556882412D-09 0.000000000D+00-6.205198010D+03 5.774956220D+01
 200.000  1000.000 7 -2.0 -1.0  0.0  1.0  2.0  3.0  4.0  0.0        19563.551
 4.940446670D+04-1.030763621D+03 1.098508036D+01 1.645945662D-02-2.178412229D-05
 1.410593520D-08-3.663931630D-12 0.000000000D+00-3.353235260D+02-2.878634227D+01
 1000.000  6000.000 7 -2.0 -1.0  0.0  1.0  2.0  3.0  4.0  0.0        19563.551
-3.067008915D+05-1.128336557D+03 1.681089243D+01-3.159107946D-04 6.850908950D-08
-7.749796920D-12 3.556100470D-16 0.000000000D+00-1.944193938D+03-5.966771040D+01'''

db = io.StringIO(r)
species_name = 'Tetrachloroethylene'

Process:

def get_coefficients(line):
    '''Split line into 5 floats.

    line has five 16 character numbers.
    '''
    #coefficients = [line[i:i+16] for i in range(0,len(line),16)]
    coefficients = [line[i:i+16] for i in range(0,80,16)] # 80 cols/line
    coefficients = map(lambda q: q.replace('D','E'), coefficients)
    coefficients = [float(thing) for thing in coefficients]
    return coefficients

for line in db:
    if species_name in line:    # first lne of the record
        # skip to the seventh line of the record
        for _ in range(6):
            line = next(db)
        coefficients_1 = get_coefficients(line)
        print(coefficients_1)
        # skip to the eighth line of the record
        line = next(db)
        coefficients_2 = get_coefficients(line)
        print(coefficients_2)

You need to address the issue brought up by @FMc. Currently your code iterates over names in a list and for each name you iterate over the complete database file looking for the name. To continue looking for the next name you need to start looking at the beginning of the file again by setting file pointer to the beginning, database.seek(0).

This is going to be very inefficient. As @Fmc indicated, you need to iterate over each line of the data base and see if it contains one of your species names. To enhance this, species_list should be a set.

species_list = {'Tetrachloroethylene', 'Bar', 'Foo'}

Unfortunately there seems to be a discrepancy between the database record format for line one and your example record -

in your example record, the first line contains the species formula and name. The database record format table suggests the first line contains either the name or the formula.
The database record format says that the name or formula is found in the the first 17 characters of the first line but the name in your example ends at the 26th character.

If line one of each record is some variant of your example and the record format definition , maybe you can try something like:

for line in db:
    stuff = line.split()
    # blank lines in db?
    if len(stuff) > 0 and stuff[0] in species_list:
        # go to lines seven and eight and get coeffs
    elif len(stuff) > 1 and stuff[1] in species_list:
        # go to lines seven and eight and get coeffs
    else:
        continue

How to retrieve data from an irregularly formatted text database in Python?

2 Answers2