1

I'm extremely new to scripting (this will be the first I've ever written on my own), and I'm struggling...

Essentially, a program I'm using outputs a multiple lists of atomic cartesian coordinates for a molecule, where each set of coordinates contains a slightly different geometry. I then want to run calculations on this molecule at each of these geometries in another program. To do this, I need to create input files containing these coordinates based on a template file, and am hoping to do this via a bash or zsh script.

The first program outputs all geometries in a single file, of the form:

  13
        -15.02035015
 C          3.0629012683       -0.1237662359       -0.0004161296
 C          1.5725410176       -0.4599705612       -0.0010537192
 H          3.6545324244       -1.0351015878       -0.0040975574
 H          3.3232895577        0.4531937573        0.8855087768
 H          3.3225341254        0.4598595336       -0.8822056347
 N          0.6972014643        0.7054585380        0.0017284824
 H          1.3274001069       -1.0545725774        0.8830977697
 H          1.3271390154       -1.0504225891       -0.8878762403
 H          0.8745667924        1.2800026498       -0.8166554074
 H          0.8753847581        1.2767560879        0.8221982135
 S         -2.4024384670       -0.0657095889       -0.0009217321
 H         -2.1207044390       -1.3609141502        0.0227283569
 H         -1.0945221361        0.2739471520        0.0001162389
  13
        -15.02029090
 C          3.0458878237       -0.1642767706       -0.0538270794
 C          1.5490175255       -0.4572401536        0.0316764611
 H          3.3628431459        0.4546044246        0.7845264240
 H          3.2796163460        0.3602842378       -0.9790015411
 H          3.6124852940       -1.0910645341       -0.0311065021
 N          0.7057821467        0.7323073404        0.0100678359
 H          1.3291157247       -0.9968212951        0.9565729700
 H          1.2449884019       -1.0864558318       -0.8086148493
 H          0.8643815373        1.2571851525       -0.8447936589
 H          0.9361625337        1.3407384060        0.7900308086
 S         -2.3784808925       -0.1009812166       -0.0319557326
 H         -2.4637876581       -0.0476175701        1.2900767837
 H         -1.0744168237        0.2509451631       -0.0171658709
etc...

essentially, one line where the number of atoms is written (always the same number within a file, but will depend on the molecule you are interested in [3 if I'm looking at water, H2O; 5 for methane, CH4; 4 for ammonia, NH3; etc.]), one comment line (in the case of this program, the energies are written there) and then the cartesian coordinates, followed directly by the next set of coordinates. In my test file, there are 49 sets of coordinates.

The template file will look something like this:

#Comment line, molecule number CONF_NUMBER
#
!B97M-D4 verytightscf verytightopt freq DefGrid3 NoRI Mass2016 UseSym
%method
functional mgga_xc_b97m_v
end
etc etc etc...

* xyz 0 1
COORDINATES
*

So, ideally I would end up writing a script which would take the coordinates of each molecule from the coordinates file and generate an input file for every listed geometry based on the template, replacing the COORDINATES text in the template with the one of the geometries (and, if possible, to include a number in the first comment line, replacing CONF_NUMBER with a number matching the directory and file name):

~/c1/molecule-c1-name.inp:

#Comment line, molecule number 1
#
!B97M-D4 verytightscf verytightopt freq DefGrid3 NoRI Mass2016 UseSym
%method
functional mgga_xc_b97m_v
end
etc etc etc...

* xyz 0 1
O 1.0 23.21 1.1
H 2.0 2.90 1.1
H 3.0 2.33 1.1
*

~/c2/molecule-c2-name.inp:

#Comment line, molecule number 2
#
!B97M-D4 verytightscf verytightopt freq DefGrid3 NoRI Mass2016 UseSym
%method
functional mgga_xc_b97m_v
end
etc etc etc...

* xyz 0 1
O 2.0 23.21 1.1
H 3.0 2.70 1.43
H 2.0 2.33 1.1
*

etc...

So far, I've been able to break up the individual geometries into separate each geometry into individual coordinate files and even remove the two lines above each geometry (which are not needed). Unfortunately, I cannot find a way to copy the whole geometry into the template; I'm stuck in a position where only a single line from the coordinate file is copies over per input file. The code I've got so far is:

#!/bin/zsh

input_file=$1


#arguments
while getopts t:j:p:s: flag
do
    case "${flag}" in
        t) template_file=${OPTARG};;
        j) job_file=${OPTARG};;
        p) prefix=${OPTARG};;
        s) suffix=${OPTARG};;
    esac
done

#Determine number of atoms
n_atoms=$(sed -n 1p $1)

#Determine number of lines to separate
splitlines=$(echo $n_atoms | awk '{print ($0+2)}')


#determine number of conformers
n_conformers=$(grep -c "$n_atoms" $1)

#echo $splitlines

#Split the coordinates into individual .xyz files
split -dl $splitlines $1 coords

#Rename coordinate files
for file in coords*
do
    sed -i -e 1,2d $file
    mv "$file" "$file.xyz"
done
rm *-e

#Copy coordinates into template file
n=1
while read file
do
  sed -i "" "s/COORDINATES/"$file"/r" template.inp > ea.h2s-input${n}.inp
((n++))
done < coords00.xyz #first coordinate file produced

In this example, coords00.xyz is the first of the separated coordinate files the script generates, template.inp is the template file, and ea.h2s-input${n}.inp is the name of the resulting input file (which I will later make customisable with arguments, hopefully).

Bear in mind that, during testing, I've been trying to get the simple things working, so this script is only written to get the first geometry copied into the template file (hence why the files are named explicitly, rather than as variables - although I'm hoping to use the arguments at the beginning of the script to help name each resulting input file), but I can't even get that to work!

Unfortunately, all other forum posts I've found only talk about copying small bits of text (a name, a word, one line of text) into templates - never multiple lines, let alone the entire contents of a file.

I have tried everything I can think of to get this to work, and this script is as close as I have gotten, but I cannot figure out how to print all of the coordinate lines into the template. Any help would be greatly appreciated!

1 Answers1

1

If you are open to using awk:

awk -v tmpl="$(<src.tmpl)" 'BEGIN{cnt=1} \
NR==1 {n_atoms=$1} \
NF==1 {flag=1; close(out); out=sprintf("coords""%02d"".xyz", cnt)} \
{if(flag==0) {print $0 > out; coord_cnt++}} \
{if(coord_cnt==n_atoms){printf "*\n" > out; coord_cnt=0}}\
{if (NF==1 && $1!=n_atoms) { flag=0; \
printf "#Comment line, molecule number %s\n%s\n", cnt, tmpl > out; cnt+=1}}' src.txt

template file contents should look like: (newlines escaped)

# \
!B97M-D4 verytightscf verytightopt freq DefGrid3 NoRI Mass2016 UseSym \
%method \
functional mgga_xc_b97m_v \
end \
etc etc etc... \
\
* xyz 0 1

src.txt contents:

13
        -15.02035015
 C          3.0629012683       -0.1237662359       -0.0004161296
 C          1.5725410176       -0.4599705612       -0.0010537192
 H          3.6545324244       -1.0351015878       -0.0040975574
 H          3.3232895577        0.4531937573        0.8855087768
 H          3.3225341254        0.4598595336       -0.8822056347
 N          0.6972014643        0.7054585380        0.0017284824
 H          1.3274001069       -1.0545725774        0.8830977697
 H          1.3271390154       -1.0504225891       -0.8878762403
 H          0.8745667924        1.2800026498       -0.8166554074
 H          0.8753847581        1.2767560879        0.8221982135
 S         -2.4024384670       -0.0657095889       -0.0009217321
 H         -2.1207044390       -1.3609141502        0.0227283569
 H         -1.0945221361        0.2739471520        0.0001162389
  13
        -15.02029090
 C          3.0458878237       -0.1642767706       -0.0538270794
 C          1.5490175255       -0.4572401536        0.0316764611
 H          3.3628431459        0.4546044246        0.7845264240
 H          3.2796163460        0.3602842378       -0.9790015411
 H          3.6124852940       -1.0910645341       -0.0311065021
 N          0.7057821467        0.7323073404        0.0100678359
 H          1.3291157247       -0.9968212951        0.9565729700
 H          1.2449884019       -1.0864558318       -0.8086148493
 H          0.8643815373        1.2571851525       -0.8447936589
 H          0.9361625337        1.3407384060        0.7900308086
 S         -2.3784808925       -0.1009812166       -0.0319557326
 H         -2.4637876581       -0.0476175701        1.2900767837
 H         -1.0744168237        0.2509451631       -0.0171658709

Output: (note output files numbering starts with '01')

$ ls coords0*
coords01.xyz coords02.xyz

$ cat coords0*
#Comment line, molecule number 1
# 
!B97M-D4 verytightscf verytightopt freq DefGrid3 NoRI Mass2016 UseSym 
%method 
functional mgga_xc_b97m_v 
end 
etc etc etc... 

* xyz 0 1
 C          3.0629012683       -0.1237662359       -0.0004161296
 C          1.5725410176       -0.4599705612       -0.0010537192
 H          3.6545324244       -1.0351015878       -0.0040975574
 H          3.3232895577        0.4531937573        0.8855087768
 H          3.3225341254        0.4598595336       -0.8822056347
 N          0.6972014643        0.7054585380        0.0017284824
 H          1.3274001069       -1.0545725774        0.8830977697
 H          1.3271390154       -1.0504225891       -0.8878762403
 H          0.8745667924        1.2800026498       -0.8166554074
 H          0.8753847581        1.2767560879        0.8221982135
 S         -2.4024384670       -0.0657095889       -0.0009217321
 H         -2.1207044390       -1.3609141502        0.0227283569
 H         -1.0945221361        0.2739471520        0.0001162389
*
#Comment line, molecule number 2
# 
!B97M-D4 verytightscf verytightopt freq DefGrid3 NoRI Mass2016 UseSym 
%method 
functional mgga_xc_b97m_v 
end 
etc etc etc... 

* xyz 0 1
 C          3.0458878237       -0.1642767706       -0.0538270794
 C          1.5490175255       -0.4572401536        0.0316764611
 H          3.3628431459        0.4546044246        0.7845264240
 H          3.2796163460        0.3602842378       -0.9790015411
 H          3.6124852940       -1.0910645341       -0.0311065021
 N          0.7057821467        0.7323073404        0.0100678359
 H          1.3291157247       -0.9968212951        0.9565729700
 H          1.2449884019       -1.0864558318       -0.8086148493
 H          0.8643815373        1.2571851525       -0.8447936589
 H          0.9361625337        1.3407384060        0.7900308086
 S         -2.3784808925       -0.1009812166       -0.0319557326
 H         -2.4637876581       -0.0476175701        1.2900767837
 H         -1.0744168237        0.2509451631       -0.0171658709
*
j_b
  • 1,975
  • 3
  • 8
  • 14
  • This script works with the example I gave you, but it seems to have problems for the general case. In the coordinates file, the first line (3, in the case of water) is always the number of atoms in the system (`n_atoms` in my script); while it will remain constant within a `src.txt` file, different `src.txt` files will have different `n_atoms`, depending on the molecule examined. How can I generalise the script to accept molecules with more (or less) atoms? I tried replacing `{if(coord_cnt==3)` with `{if(coord_cnt=="$n_atoms")` from my own script, but that didn't work. – isolated matrix Oct 27 '22 at 06:28
  • Unfortunately, that still only works if the number of atoms is 3. Is there any way to get awk to read the number on the first line and assign that as the variable automatically? For instance, first file I want to apply this script to contains 13 atoms. Is there any way to make the script read in 13 and assign that as the variable? – isolated matrix Oct 28 '22 at 06:31
  • Ah, so it’s not possible for the script to detect the number of atoms automatically, it has to be adjusted each time I use it? – isolated matrix Oct 28 '22 at 14:04
  • Sure it is possible. See updated answer for alternative version of `awk` command. – j_b Oct 28 '22 at 14:17
  • It works perfectly for the example I gave you but, for some reason, not my file. I'll update the question above so that it has some of the values directly copied from one of the files. Maybe there's something I've not included without realising it? So sorry to keep troubling you about this - I really appreciate your help with this! – isolated matrix Oct 28 '22 at 15:29
  • Ok, based on your updates to the sample data, the `awk` as written will not function properly. I will adjust based on the updated sample data. Answer updated accordingly. – j_b Oct 28 '22 at 17:00