Pandas: start and stop parsing after a delimiter keyword

Question

I'm a chemist dealing with Potential Energy Distributions and the output is kind of messy (some lines use more columns than others) and we have several analysis in one file so I'd like to start and stop parsing when I see some specific "keywords" or signs like "***".

Here is an example of my input:

Average max. Potential Energy <EPm> = 41.291
TED Above 100 Factor TAF=0.011
Average coordinate population 1.000
s 1     1.00   STRE    4    7   NH    1.015024  f3554 100
s 2     1.00   STRE    2    1   CH    1.096447  f3127 13  f3126 13  f3073 37  f3073 34
s 3     1.00   STRE    2    5   CH    1.094347  f3127 38  f3126 36  f3073 12  f3073 11
s 4     1.00   STRE    6    8   CH    1.094349  f3127 36  f3126 38  f3073 11  f3073 13
s 5     1.00   STRE    2    3   CH    1.106689  f2950 48  f2944 46
s 6     1.00   STRE    6    9   CH    1.106696  f2950 47  f2944 47
s 7     1.00   STRE    6   10   CH    1.096447  f3127 12  f3126 13  f3073 33  f3073 38
s 8     1.00   STRE    4    2   NC    1.450644  f1199 43  f965 39
s 9     1.00   STRE    4    6   NC    1.450631  f1199 43  f965 39
s 10    1.00   BEND    7    4    6   HNC   109.30  f1525 12  f1480 42  f781 18
s 11    1.00   BEND    1    2    3   HCH   107.21  f1528 33  f1525 21  f1447 12
s 12    1.00   BEND    5    2    1   HCH   107.42  f1493 17  f1478 36  f1447 20
s 13    1.00   BEND    8    6   10   HCH   107.42  f1493 17  f1478 36  f1447 20
s 14    1.00   BEND    3    2    5   HCH   108.14  f1525 10  f1506 30  f1480 14  f1447 13
s 15    1.00   BEND    9    6    8   HCH   108.13  f1525 10  f1506 30  f1480 14  f1447 13
s 16    1.00   BEND   10    6    9   HCH   107.20  f1528 33  f1525 21  f1447 12
s 17    1.00   BEND    6    4    2   CNC   112.81  f383 85
s 18    1.00   TORS    7    4    2    1   HNCH  -172.65  f1480 10  f781 55
s 19    1.00   TORS    1    2    4    6   HCNC    65.52  f1192 27  f1107 14  f243 18
s 20    1.00   TORS    5    2    4    6   HCNC  -176.80  f1107 17  f269 35  f243 11
s 21    1.00   TORS    8    6    4    2   HCNC  -183.20  f1107 17  f269 35  f243 11
s 22    1.00   TORS    3    2    4    6   HCNC   -54.88  f1273 26  f1037 22  f243 19
s 23    1.00   TORS    9    6    4    2   HCNC    54.88  f1273 26  f1037 22  f243 19
s 24    1.00   TORS   10    6    4    2   HCNC   -65.52  f1192 30  f1107 18  f243 21
****
 9 STRE modes:
  1  2  3  4  5  6  7  8  9
 8 BEND modes:
 10 11 12 13 14 15 16 17
 7 TORS modes:
 18 19 20 21 22 23 24
 19 CH modes:
  2  3  4  5  6  7 11 12 13 14 15 16 18 19 20 21 22 23 24
 0 USER modes:


alternative coordinates 25 
k 10    1.00   BEND    7    4    2   HNC   109.30
k 11    1.00   BEND    1    2    4   HCN   109.41
k 12    1.00   BEND    5    2    4   HCN   109.82
k 13    1.00   BEND    8    6    4   HCN   109.82
k 14    1.00   BEND    3    2    1   HCH   107.21
k 15    1.00   BEND    9    6    4   HCN   114.58
k 16    1.00   BEND   10    6    8   HCH   107.42
k 18    1.00   TORS    7    4    2    5   HNCH   -54.98
k 18    1.00   TORS    7    4    2    3   HNCH    66.94
k 18    1.00   OUT     4    2    6    7   NCCH    23.30
k 19    1.00   OUT     2    3    5    1   CHHH    21.35
k 19    1.00   OUT     2    1    5    3   CHHH    21.14
k 19    1.00   OUT     2    3    1    5   CHHH    21.39
k 20    1.00   OUT     2    1    4    5   CHNH    21.93
k 20    1.00   OUT     2    5    4    1   CHNH    21.88
k 20    1.00   OUT     2    1    5    4   CHHN    16.36
k 21    1.00   TORS    8    6    4    7   HCNH    54.98
k 21    1.00   OUT     6   10    9    8   CHHH    21.39
k 22    1.00   OUT     2    1    4    3   CHNH    20.12
k 22    1.00   OUT     2    5    4    3   CHNH    19.59
k 23    1.00   TORS    9    6    4    7   HCNH   -66.94
k 23    1.00   OUT     6    8    4    9   CHNH    19.59
k 24    1.00   TORS   10    6    4    7   HCNH  -187.34
k 24    1.00   OUT     6    9    4   10   CHNH    20.32
k 24    1.00   OUT     6    8    4   10   CHNH    21.88

I'd like to skip the first 3 lines (I know how to do that with skiprows=3) then I'd like to stop parsing at the "***" and accommodate my content into 11 columns with predefined names like "tVib1" "%PED1" "tVib2" "%PED2" etc.

After that, I'll have, in this same file to start parsing after the word "alternative coordinates" into 11 columns.

Looks very hard to achieve for me.

Any help is much appreciated.

I just noticed that the columns are really messy. It seems that certain columns are randomly omitted in both segments. How do you know, which value belongs into which of your 11 columns? — Mr. T, Jan 26 '18 at 15:00
@Piinthesky, you're right, it is a mess. and looks challenging to create a good Python script to organize it. I know that we have columns missing by comparing each line with the whole set of 11 columns. — HCSthe2nd, Jan 26 '18 at 16:22
@Piinthesky, thank you for taking a look at this. Here is the link https://1drv.ms/f/s!AscFK8cOesFhiOI2tk2k_n_Ysu4SGA — HCSthe2nd, Jan 28 '18 at 13:57

score 1 · Answer 1 · answered Jan 26 '18 at 14:48

Seems not that hard, you already described all you want, all you need is to translate it to Python. First you can parse your whole file and store it in a list of lines:

with open(filename,'r') as file_in:
    lines = file_in.readlines()

then you can begin reading from line 3 and parse until you find the "***":

ind = 3
while x[ind].find('***') != -1:
    tmp = x[ind]
    ... do what you want with tmp ...
    ind = ind + 1

and then you can keep on doing whatever you need, replacing find("...") by any keyword you need.

To manage each of your lines "tmp", you can use very useful Python functions like tmp.split(), tmp.strip(), convert any string to a number, etc.

Mr. T · Accepted Answer · 2018-01-30T14:57:17.567

For the .dd2 file provided, I used another strategy. The implicit assumptions are
1) a line is only converted, when it starts with either a lower case - space - digit or with at least five whitespaces, followed by at least one upper case word
2) if missing, the first, third and every f - column is reused from the last line
3) the third column contains the first upper case word
4) if the difference between the first upper case words is less than a given variable max_col, NaN is introduced for the missing values 5) f value columns start two columns after the second upper case column

import re
import pandas as pd
import numpy as np

def align_columns(file_name, col_names = ["ID", "N1", "S1", "N2", "N3", "N4", "N5", "S2", "N6"], max_col = 4):
    #max_col: number of columns between the two capitalised columns
    #column names for the first values N = number, S = string, F = f number, adapt to your needs
    #both optional parameters 

    #collect all data sets as a list of lists
    all_lines = []
    last_id, last_cat, last_fval = 0, 0, []

    #opening file to read
    for line_in in open(file_name, "r"):
        #use only lines that start either
        #with lower case - space - digit or at least five spaces
        #and have an upper case word in the line
        start_str = re.match("([a-z]\s\d|\s{5,}).*[A-Z]+", line_in)
        if not start_str:
            continue

        #split data columns into chunks using 2 or more whitespaces as a delimiter
        sep_items = re.split("\s{2,}", line_in.strip())
        #if ID is missing use the information from last line
        if not re.match("[a-z]\s\d", sep_items[0]):
            sep_items.insert(0, last_id)
            sep_items.insert(2, last_cat)
            sep_items.extend(last_fval)
        #otherwise keep the information in case it is missing from next line
        else:
            last_id = sep_items[0]
            last_cat = sep_items[2]

        #get index for the two columns with upper case words
        index_upper = [i for i, item in enumerate(sep_items) if item.isupper()]

        if len(index_upper) < 2 or index_upper[0] != 2 or index_upper[1] > index_upper[0] + max_col + 1:
            print("Irregular format, skipped line:")
            print(line_in)
            continue

        #get f values in case they are missing for next line
        last_fval = sep_items[index_upper[1] + 2:]

        #if not enough rows between the two capitalised columns, fill with NaN
        if index_upper[1] < 3 + max_col:
            fill_nan = [np.nan] * (3 + max_col - index_upper[1])
            sep_items[index_upper[1]:index_upper[1]] = fill_nan
        #append to list
        all_lines.append(sep_items)

    #create pandas dataframe from list
    df = pd.DataFrame(all_lines)
    #convert columns to float, if possible
    df = df.apply(pd.to_numeric, errors='ignore', downcast='float')
    #label columns according to col_names list and add f0, f1... at the end
    df.columns = [col_names[i] if i < len(col_names) else "f" + str(i - len(col_names)) for i in df.columns] 
    return df

#-----------------main script--------------
#use standard parameters of function
conv_file = align_columns("a1-91a.dd2")
print(conv_file)

#use custom parameters for labels and number of fill columns 
col_labels = ["X1", "Y1", "Z1", "A1", "A2", "A3", "A4", "A5", "A6", "Z2", "B1"]
conv_file2 = align_columns("a1-91a.dd2", col_labels, 6)
print(conv_file2)

This is more flexible than the first solution. The number of f value columns is not restricted to a specific number.
The example shows you, how to use it with standard parameters defined by the function and with custom parameters. It is surely not the most beautiful solution, and I am happy to upvote any more elegant solution. But it works, at least in my Python 3.5 environment. If there are any problems with a data file, please let me know.

P.S.: The solution to convert the appropriate columns into float was provided by jezrael

what an elegant solution! It is amazing. Thank you for using your time on this. — HCSthe2nd, Jan 30 '18 at 23:27
You're welcome. Please look carefully over your data after importing. There is a good chance, that I might have not considered an exception in the data structure. But then again, scientists scrutinise their data at every step of the data processing. A last question: The files - are they generated by commercial scientific software? — Mr. T, Jan 30 '18 at 23:43
It is generated by a free software called VEDA: http://smmg.pl/software/veda And after I started taking a look at it I saw that the generated data is a mess, but my colleagues at the research team use it a lot, so it is important to make it less painful. Again, thank you. — HCSthe2nd, Jan 31 '18 at 11:20

Mr. T · Answer 3 · 2018-01-30T01:05:15.940

I made a first script according to your example here on SO. It is not very flexible - it assumes that the first three columns are filled with values and aligns then the two columns with uppercase words by filling the four columns in between with NaN, if necessary. The reason to fill it with this value is that pandas function like .sum() or .mean() ignore this, when calculating the value for a column.

import re
import io 
import pandas as pd

#adapt this part to your needs    
#enforce to read 12 columns, N = number, S = string, F = f number
col_names = ["ID", "N1", "S1", "N2", "N3", "N4", "N5", "S2", "N6", "F1", "F2", "F3"]
#only import lines that start with these patterns
startline = ("s ", "k ")    
#number of columns between the two capitalised columns
max_col = 4                

#create temporary file like object to feed later to the csv reader
pan_wr = io.StringIO()

#opening file to read
for line in open("test.txt", "r"):
    #checking, if row should be ignored
    if line.startswith(startline):
        #find the text between the two capitalized columns
        col_betw = re.search("\s{2,}([A-Z]+.*)\s{2,}[A-Z]+\s{2,}", line).group(1)
        #determine, how many elements we have in this segment
        nr_col_betw = len(re.split(r"\s{2,}", col_betw.strip()))
        #test, if there are not enough numbers  
        if nr_col_betw <= max_col:
            #fill with NA, which is interpreted by pandas csv reader as NaN
            subst = col_betw + "   NA" * (max_col - nr_col_betw + 1) 
            line = line.replace(col_betw, subst, 1)
        #write into file like object the new line
        pan_wr.writelines(line)

#reset pointer for csv reader 
pan_wr.seek(0)

#csv reader creates data frame from file like object, splits at delimiter with more than one whitespace
#index_col: the first column is not treated as an index, names: name for columns
df = pd.read_csv(pan_wr, delimiter = r"\s{2,}", index_col = False, names = col_names, engine = "python")
print(df)

This works well, but can't deal with the .dd2 file you posted later. I am currently testing a different approach for this. to be continued...

P.S.: I found conflicting information on the use of index_col = False by the csv reader. Some say, you should use now index_col = None, to suppress that the first column is converted into the index, but it didn't work in my tests.

Pandas: start and stop parsing after a delimiter keyword

3 Answers3