How do you select a certain part of a grep output?

Question

I am trying to substitute coordinates of a particular line in one file for the coordinates of a different file. Both of them have a line in them that has "code word" in them and that is where the coordinates are found. The ccordinates are also on the same sets of columns, 33-54, if that helps. How can I label a certain part of the line of interest as a variable so I could use sed to substitute? This is what I have so far:

#!/bin/bash 
FILE=$1 
grep -i "ABC DEF" $FILE.pdb 

# Somehow select the coordinates in the line with "ABC DEF" in $FILE.pdb and label it PDBcoords
PDBcoords=$unknownfunction1

$Somehow select the coordinates in the line with "ABC DEF" in reference.pdb and label it refcoords
grep -i "ABC DEF" reference.pdb
refcoords=$unknownfunction2

sed -i 's/$refcoords/$PDBcoords/' 
wait
echo "Whole Command Done for $FILE"

The grep outputs looks like this:

ATOM   5103  ABC DEF A 100       5.817   2.502 -21.483  1.00 13.63           O

and I only want to select the coordinates

5.817   2.502 -21.483

However, these coordinates change for every file, so I need to label these columns as a variable. Same goes for the reference pdb.

EDIT I came up with this solution:

#!/bin/bash
FILE=$1
PDB=$(grep -i "OXT ORN" $FILE.pdb | cut -c 33-54)
PDBcoords="$(echo "$PDB")"
echo $PDBcoords
echo Found PDB Coordinates for $FILE
pkaSH=$(grep -i "OXT  ORN" pkaSH.pdb | cut -c 33-54)
pkaSHcoords="$(echo "$pkaSH")"
echo $pkaSHcoords
echo Found pkaSH Coordinates for $FILE
sed -i "s/$pkaSHcoords/$PDBcoords/" pkaSH.pdb
echo Command Done

My idea was to redirect the grep output to a temporary file, cut out the coordinate columns, and then define that as a variable with spaces preserved. I'm sure this was overcomplicated, but since it works I think I have my answer.

you mention the coordinates will always be in the same columns (33-54), but the sample row you've provided (assuming `ATOM` starts in column #1) shows the coordinates are between columns 36 and 56 (inclusive); could you explain the differences and update the question accordingly? also, will `$FILE.pdb` and `reference.pdb` have cooridnates located in the same exact column range? lastly, your `sed` command is missing a filename ... what file are you running the `sed` against? — markp-fuso, Dec 19 '20 at 04:07
sorry about that what throws it off is code name because the code name is not actually that long, Ill change it to something of the right length and I'll fix the sed. There is a space before the coordinate in the example because there may be a negative sign. — sweetandtangy, Dec 19 '20 at 05:08
the key take-away is that this is a fixed-width file, right? my answer makes some assumptions about which columns are the 'correct' column positions ... it should be easy enough for you to modify the proposed code to use your actual column positions — markp-fuso, Dec 19 '20 at 05:12
Yes it is fixed-width. I came up with something, although it may be a bit convoluted, but it works. I use the grep and cut commands together but I used some temporary files. I will update the question with what I have. — sweetandtangy, Dec 19 '20 at 05:30
fwiw, `PDB` and `PDBcoords` have the same value (likewise for `ref` and `refcoords`) - `PDBcoords=$(echo "$PDB")` is the same as `PDBcoords="${PDB"}` - so unless there is some follow-on processing that needs to manipulate 2 different copies of the coordinates, the dual variables aren't needed; `wait` is used to wait for the completion of a process that started in the background, but since there is no background processing there's nothing to 'wait' for, net result is the `wait's` are not needed — markp-fuso, Dec 19 '20 at 15:35
I was just afraid the multiple spaces wouldn’t be preserved so I used the double quotes of PDB to make sure — sweetandtangy, Dec 19 '20 at 15:40

score 2 · Answer 1 · answered Dec 19 '20 at 04:05

2

Another option is:

tr -s ' ' | cut -d ' ' -f 7-9

Where tr -s is used to compress all multiple spaces into a single space and then cut -d ' ' -f 7-9 outputs the space delimited 7th-9th fields, e.g.

$ echo "ATOM   5103  code name A 100       5.817   2.502 -21.483  1.00 13.63           O" | 
tr -s ' ' | cut -d ' ' -f 7-9
5.817 2.502 -21.483

answered Dec 19 '20 at 04:05

David C. Rankin

81,885
6
58
85

I'm not sure if this can help me because the coordinates have multiple spaces in the file so I need the spaces to be preserved when I use sed to replace them. – sweetandtangy Dec 19 '20 at 05:14

markp-fuso · Accepted Answer · 2020-12-19T05:34:35.217

Assumptions/Understandings ...

OP has mentioned the coordinates are always in columns 33-54 (ie, data is in a fixed-width format as opposed to some sort of delimited format)
the sample data shows the coordinates are in columns 36-56 (inclusive)
for the sake of this answer I'm going to assume the coordinates reside in columns 33-56 (inclusive; total of 24 columns); this will allow me to use the sample data
assuming various non-coordinate columns may have embedded spaces (eg, code word)
assuming the search pattern (eg, code name) will only match a single row in each file ($FILE.pdb and reference.pdb)

Sample data (in place of $FILE.pdb I'm using codeword.pdb):

$ cat codeword.pdb
ATOM   5103  something else       23.219  12.880 -78.003  1.00 13.63           O
ATOM   5103  code name A 100       5.817   2.502 -21.483  1.00 13.63           O
ATOM   5103  not this line buddy 105.199 342.192  -1.423  1.00 13.63           O

One idea using grep and cut:

ptn="code name"

grep -i "${ptn}" codeword.pdb | cut -c33-56

This generates:

   5.817   2.502 -21.483

Capturing the output to a variable:

PDBcoords="$(grep -i "${ptn}" codeword.pdb | cut -c33-56)"

echo ".${PDBcoords}."                  # decimals are added as visual delimiters
echo "${#PDBcoords}"                   # number of characters in variable

This generates:

.   5.817   2.502 -21.483.
24

NOTES:

the output does contain some leading spaces, for now I'm assuming this is good in case a replacement string is wider, ie, this should ensure columns 33-56 are replaced (assuming, of course, that for all files the coordinates span the same number of columns)
OP should be able to use the same code to pull coordinates from reference.pdb for storage in the $refcoords variable
OP can change the numbers in this code to match the actual column positions (and widths) for both files $FILE.pdb and reference.pdb

As for the sed portion of OP's code ...

at the time I wrote up this answer the sed command is incomplete (I'm assuming the sed target is $FILE.pdb)
assuming there could be multiple lines with the same coordinates, we'll need to match on both code name and $PDBcoords

One sed idea:

ptn="Code NAME"                          # mix it up, show case insensitivity
PDBcoords="   5.817   2.502 -21.483"
refcoords=" 103.227  23.285  -1.223"

sed "/${ptn}/Is/${PDBcoords}/${refcoords}/" codeword.pdb

Where:

/I - perform case insensitive match
s/ .... / .... / - replace old coordinates with new coordinates (assumes the 2 variables (PDBcoords and refcoords) are of the same length in order to maintain column positions in the output)

This generates:

############## before image for sake of comparison:

ATOM   5103  something else       23.219  12.880 -78.003  1.00 13.63           O
ATOM   5103  code name A 100       5.817   2.502 -21.483  1.00 13.63           O
ATOM   5103  not this line buddy 105.199 342.192  -1.423  1.00 13.63           O

############## results of the `sed` command:

ATOM   5103  something else       23.219  12.880 -78.003  1.00 13.63           O
ATOM   5103  code name A 100     103.227  23.285  -1.223  1.00 13.63           O
ATOM   5103  not this line buddy 105.199 342.192  -1.423  1.00 13.63           O

NOTE: Once OP has confirmed this performs the desired modification the -i flag can be added to the sed command to allow for in place updating of $FILE.pdb.

score 0 · Answer 3 · answered Dec 19 '20 at 03:57

0

I don't know if all files have the same type of "columns", but if so awk might be what you need

echo ATOM   5103  code name A 100       5.817   2.502 -21.483  1.00 13.63           O | awk '{ print $7, $8, $9 }

# outputs: 5.817 2.502 -21.483

answered Dec 19 '20 at 03:57

Julien B.

3,023
2
18
33

LC-datascientist · Answer 4 · 2020-12-19T04:13:17.000

0

You can use awk to select columns

grep -i "code name" reference.pdb | awk '{print $7,$8,$9}'

or use cut

grep -i "code name" reference.pdb | tr -s " " | cut -d" " -f 7-9

In both codes, you will be extracting the seventh, eighth, and ninth columns, delimited by white space.

Edit

Reference: How to specify more spaces for the delimiter using cut?

edited Dec 19 '20 at 04:13

answered Dec 19 '20 at 03:59

LC-datascientist

1,960
1
18
32

How do you select a certain part of a grep output?

4 Answers4