-3

Background

I have a .xyz file from which I need to remove a specific set of lines from. As well as do some text replacements. I have a separate .txt file that contains a list of integers, corresponding to line numbers that need to be removed, and another for the lines which need replacing. This file will be called atomremove.txt and looks as follows. The other file is structured similarly.

Just as a preemptive TL;DR: The tabs in my input file that happen to have one extra whitespace (because they justify to a certain position regardless of one extra whitespace), end up being converted to a single whitespace in the output file.

14
13
11
10
4

The xyz file from which I need to remove lines will look like something like this.

24
Comment block
H   18.38385    15.26701    2.28399
C   19.32295    15.80772    2.28641
O   16.69023    17.37471    2.23138
B   17.99018    17.98940    2.24243
C   22.72612    1.13322     2.17619
C   14.47116    18.37823    2.18809
C   15.85803    18.42398    2.20614
C   20.51484    15.08859    2.30584
C   22.77653    3.65203     2.19000
H   20.41328    14.02079    2.31959
H   22.06640    8.65013     2.27145
C   19.33725    17.20040    2.26894
H   13.96336    17.42048    2.19342
H   21.69450    3.68090     2.22196
C   23.01832    9.16815     2.25575
C   23.48143    2.42830     2.16161
H   22.07113    11.03567    2.32659
C   13.75496    19.59644    2.16380
O   23.01248    6.08053     2.20226
C   12.41476    19.56937    2.14732
C   16.54400    19.61620    2.20021
C   23.50500    4.83405     2.17735
C   23.03249    10.56089    2.28599
O   17.87129    19.42333    2.22107

My Code

I am successful in doing the line removal, and the replacements, although the output is not as expected. It appears to replace some of the tabs with the whitespace, specifically for lines that have a 'y' coordinate with only 5 decimals. I am going to share the resulting output first, and then my code.

Here is the output

19
Comment Block
H   18.38385    15.26701    2.28399
C   19.32295    15.80772    2.28641
O   16.69023    17.37471    2.23138
H   22.72612    1.13322 2.17619
C   14.47116    18.37823    2.18809
C   15.85803    18.42398    2.20614
C   20.51484    15.08859    2.30584
C   22.77653    3.65203 2.19000
C   19.33725    17.20040    2.26894
C   23.01832    9.16815 2.25575
C   23.48143    2.42830 2.16161
H   22.07113    11.03567    2.32659
C   13.75496    19.59644    2.16380
O   23.01248    6.08053 2.20226
C   12.41476    19.56937    2.14732
C   16.54400    19.61620    2.20021
C   23.50500    4.83405 2.17735
H   23.03249    10.56089    2.28599
O   17.87129    19.42333    2.22107

Here is my code.

atomstorefile="./extract_internal/atomremove.txt"
atomchangefile="./extract_internal/atomchange.txt"

temp="temp.txt"
tempp="tempp.txt"
temppp="temppp.txt"
filestoreloc="./"$basefilename"_xyzoutputs/chops"

#get number of files in directory and set a loop for that # of files
numfiles=$( ls "./"$basefilename"_xyzoutputs/splits" | wc -l )
numfiles=$(( numfiles/2 ))
counter=1

while [ $counter -lt $(( numfiles + 1 )) ];
do
    #set a loop for each split half
    splithalf=1
    while [ $splithalf -lt 3 ];
    do
        #storing the xyz file in a temp file for edits (non destructive)
        cat ./"$basefilename"_xyzoutputs/splits/split"$splithalf"-geometry$counter.xyz > $temp

#changin specified atoms
        while read line;
        do
            line=$(( line + 2 ))
            sed -i "${line}s/C/H/" $temp
        done < $atomchangefile

# removing specified atoms
        while read line;
        do
            line=$(( line + 2 ))
            sed -i "${line}d" $temp
        done < $atomstorefile
    
        remainatoms=$( wc -l $temp | awk '{print $1}' )
        remainatoms=$(( remainatoms - 2 ))
        tail -n $remainatoms $temp > $tempp
        echo $remainatoms > "$filestoreloc"/split"$splithalf"-geometry$counter.xyz
        echo Comment Block >> "$filestoreloc"/split"$splithalf"-geometry$counter.xyz
        cat $tempp >> "$filestoreloc"/split"$splithalf"-geometry$counter.xyz
    
        splithalf=$(( splithalf + 1 ))
    done
    

    counter=$(( counter + 1 ))
done

I am sure the solution is simple. Any insight into what is causing this issue would be very appreciated.

tripleee
  • 175,061
  • 34
  • 275
  • 318
Tanmann13
  • 1
  • 1
  • See also [Counting lines or enumerating line numbers so I can loop over them - why is this an anti-pattern?](https://stackoverflow.com/questions/65538947/counting-lines-or-enumerating-line-numbers-so-i-can-loop-over-them-why-is-this) – tripleee Jul 18 '21 at 09:15
  • @Tanmann13 - The shown code does not change TABs in the .xyz files, so either some other code does it, or the TABs aren't changed at all and you mistakenly think so, perhaps because viewing input and output files under different conditions. You can verify this with e. g. `cat -A …`. – Armali Jul 18 '21 at 17:58
  • Replace `echo $remainatoms >>...` with `echo "$remainatoms" >>` . Small demo: `printf -v a "%s \t%s" space andtab; echo $a; echo "$a" ` – Walter A Jul 18 '21 at 21:21

2 Answers2

0

Not sure what you are doing but you file can be fixed using column -t < filename command.

Example :

❯ cat test
H   18.38385    15.26701    2.28399
C   19.32295    15.80772    2.28641
O   16.69023    17.37471    2.23138
H   22.72612    1.13322 2.17619
C   14.47116    18.37823    2.18809
C   15.85803    18.42398    2.20614
C   20.51484    15.08859    2.30584
C   22.77653    3.65203 2.19000
C   19.33725    17.20040    2.26894
C   23.01832    9.16815 2.25575
C   23.48143    2.42830 2.16161
H   22.07113    11.03567    2.32659
C   13.75496    19.59644    2.16380
O   23.01248    6.08053 2.20226
C   12.41476    19.56937    2.14732
C   16.54400    19.61620    2.20021
C   23.50500    4.83405 2.17735
H   23.03249    10.56089    2.28599
O   17.87129    19.42333    2.22107

~ 
❯ column -t  < test
H  18.38385  15.26701  2.28399
C  19.32295  15.80772  2.28641
O  16.69023  17.37471  2.23138
H  22.72612  1.13322   2.17619
C  14.47116  18.37823  2.18809
C  15.85803  18.42398  2.20614
C  20.51484  15.08859  2.30584
C  22.77653  3.65203   2.19000
C  19.33725  17.20040  2.26894
C  23.01832  9.16815   2.25575
C  23.48143  2.42830   2.16161
H  22.07113  11.03567  2.32659
C  13.75496  19.59644  2.16380
O  23.01248  6.08053   2.20226
C  12.41476  19.56937  2.14732
C  16.54400  19.61620  2.20021
C  23.50500  4.83405   2.17735
H  23.03249  10.56089  2.28599
O  17.87129  19.42333  2.22107

~ 
❯ 
Digvijay S
  • 2,665
  • 1
  • 9
  • 21
0

The reason you wreck your whitespace is that you need to quote your strings. But a much superior solution is to refactor all of this monumentally overcomplicated shell script to a simple sed or Awk script.

Assuming the line numbers all indicate line numbers in the original input file, try this.

tmp=$(mktemp -t atomtmpXXXXXXXXX) || exit
trap 'rm -f "$tmp"' ERR EXIT

( sed 's%$%s/C/H/%' extract_internal/atomchange.txt
  sed 's%$%d%' extract_internal/atomremove.txt ) >"$tmp"

ls -l "$tmp"; nl "$tmp" # debugging

for file in "$basefilename"_xyzoutputs/splits/*; do
    dst= "$basefilename"_xyzoutputs/chops/${file#*/splits/}
    sed -f "$tmp" "$file" >"$dst"
done

This combines the two input files into a new sed script (remarkably, by way of sed); the debugging line lets you inspect the result (probably remove it once you understand how this works).

Your question doesn't really explain how the input files relate to the output files so I had to guess a bit. One of the important changes is to avoid sed -i when you are not modifying an existing file; but above all, definitely avoid repeatedly overwriting the same file with sed -i.

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • You don't explain why you add +2 to all the line numbers; if this is really necessary, try `awk '{ print 2 + $1 ( NR == FNR ? "s/C/H/" : "d") }' extract_internal/atomchange.txt extract_internal/atomremove.txt >"$tmp"` instead of the parenthesized `sed` scripts. – tripleee Jul 18 '21 at 11:40
  • Actually I don't think the lack of quoting explains your problem. In fact, your problem seems unreproducible with the information you provided. – tripleee Jul 18 '21 at 11:45