3

I have a text file as shown below

ATOM    920  CA  GLN A 203      39.292 -13.354  17.416  1.00 55.76           C 
ATOM    929  CA  HIS A 204      38.546 -15.963  14.792  1.00 29.53           C
ATOM    939  CA  ASN A 205      39.443 -17.018  11.206  1.00 54.49           C  
ATOM    947  CA  GLU A 206      41.454 -13.901  10.155  1.00 26.32           C
ATOM    956  CA  VAL A 207      43.664 -14.041  13.279  1.00 40.65           C 
.
.
.

ATOM    963  CA  GLU A 208      45.403 -17.443  13.188  1.00 40.25           C  

I would like to calculate the distance between two alpha carbon atoms i.e calculate the distance between first and second atom and then between second and third atom and so on..... The distance between two atoms can be expressed as:distance = sqrt((x1-x2)^2+(y1-y2)^2+(z1-z2)^2) .

The columns 7,8 and 9 represents x,y and z co-ordinates respectively.I need to print the distance and the corresponding residue pairs(column 4) as shown below.(the values of distance are not real)

GLN-HIS   4.5
HIS-ASN   3.2
ASN-GLU   2.5

How can I do this calculation with perl or python?

user1866262
  • 61
  • 1
  • 3
  • I have the very same problem, But I'nm working with python, are there similar solution there as well? – Prelude Feb 14 '14 at 15:18

6 Answers6

6

Do NOT split on whitespace

The other answers given here make a flawed assumption - that coordinates will be space-delimited. Per the PDB specification of ATOM, this is not necessarilly the case: PDB record values are specified by column indices, and may flow into one another. For instance, your first ATOM record reads:

ATOM    920  CA  GLN A 203      39.292 -13.354  17.416  1.00 55.76           C 

But this is perfectly valid as well:

ATOM    920  CA  GLN A 203      39.292-13.3540  17.416  1.00 55.76           C 

The better approach

Because of the column-specified indices, and the number of other problems that can occur in a PDB file, you should not write your own parser. The PDB format is messy, and there's a lot of special cases and badly formatted files to handle. Instead, use a parser that's already written for you.

I like Biopython's PDB.PDBParser. It will parse the structure for you into Python objects, complete with handy features. If you prefer Perl, check out BioPerl.

PDB.Residue objects allow keyed access to Atoms by name, and PDB.Atom objects overload the - operator to return distance between two Atoms. We can use this to write clean, concise code:

Code

from Bio import PDB
parser = PDB.PDBParser()

# Parse the structure into a PDB.Structure object
pdb_code = "1exm"
pdb_path = "pdb1exm.ent"
struct = parser.get_structure(pdb_code, pdb_path)

# Grab the first two residues from the structure
residues = struct.get_residues()
res_one = residues.next()
res_two = residues.next()

try:
    alpha_dist = res_one['CA'] - res_two['CA']
except KeyError:
    print "Alpha carbon missing, computing distance impossible!"
David Cain
  • 16,484
  • 14
  • 65
  • 75
5

If your data is separated by whitespace, a simple split can do the job. Buffering the lines to compare them to each other sequentially.

use strict;
use warnings;

my @line;
while (<>) {
    push @line, $_;            # add line to buffer
    next if @line < 2;         # skip unless buffer is full
    print proc(@line), "\n";   # process and print 
    shift @line;               # remove used line 
}

sub proc {
    my @a = split ' ', shift;   # line 1
    my @b = split ' ', shift;   # line 2
    my $x = ($a[6]-$b[6]);      # calculate the diffs
    my $y = ($a[7]-$b[7]);
    my $z = ($a[8]-$b[8]);
    my $dist = sprintf "%.1f",                # format the number
                   sqrt($x**2+$y**2+$z**2);   # do the calculation
    return "$a[3]-$b[3]\t$dist"; # return the string for printing
}

Output (with the sample data):

GLN-HIS 3.8
HIS-ASN 3.8
ASN-GLU 3.9
GLU-VAL 3.8

If your data is tab separated, you can split on /\t/ instead of ' '.

TLP
  • 66,756
  • 10
  • 92
  • 149
  • Thank you very much for your answer. Your code works well. But I don't get the output line-by-line. I would like to change the appearance of my output like your sample output.How can I change? I appreciate your help!! – user1866262 Nov 30 '12 at 14:53
  • Oh right, forgot that.. I ran the script with the `-l` flag while testing. `perl -l script.pl input.txt`. I'll fix it. – TLP Nov 30 '12 at 15:01
  • An important note: PDB atom coordinates are **not necessarily whitespace-delimited** (see the [specification](http://deposit.rcsb.org/adit/docs/pdb_atom_format.html#ATOM)). – David Cain May 11 '13 at 16:16
  • Instead of calculating distance from one CA to the next one and so on.. how we can modify above answer to calculate distance between one CA to all other CAs and then distance from next CA to rest of the CAs and so on...? – pradeep pant Dec 15 '16 at 11:05
4

Assuming your data are in "atoms.txt", this reads it in line by line and splits the entries into a list:

import csv

with open("atoms.txt") as f:
  reader = csv.reader(f)
  for line, in reader:
      entries = line.split()
      print entries

Now for each list extract the columns you need, and calculate the distances etc (Bear in mind that the lists in python are zero-based).

ev-br
  • 24,968
  • 9
  • 65
  • 78
  • Did you test this code? Never mind the mistake with `for line, in reader:` - the return of that will be a `list` which won't have a `split` method... – Jon Clements Nov 30 '12 at 13:11
  • Exactly what I said the first time... You'll get a ValueError for `for line, in reader`, and once that's fixed - you'll get a `AttributeError: 'list' object has no attribute 'split'` on `entries = line.split()` – Jon Clements Nov 30 '12 at 13:24
  • No, I don't get that. Copy-pasting the first five lines from the text file in the question, and running the code from my answer prints a list, as desired. (python 2.6 here, don't know about python 3 --- does `csv.reader` differ?) – ev-br Nov 30 '12 at 13:31
  • Try putting the first five lines in `atoms.txt` and running the code exactly as you've posted... – Jon Clements Nov 30 '12 at 13:33
  • Like I was saying, that's exactly what I do, and that prints the list of tokens. – ev-br Nov 30 '12 at 13:35
  • `for line, in reader:` returns the a string, not list. Removing the comma (`for line in reader:`) returns a list with one element, hence unpacking. – ev-br Nov 30 '12 at 13:36
  • Okay - it does work, just baulks on blank lines... Umm, if you're going to use `csv.reader` then specify a `delimiter=' '` and you don't have to split it, or just don't use the reader...? – Jon Clements Nov 30 '12 at 13:44
  • Ah, good, so it does work, after all :-). Regarding blank lines --- I don't see any mention of blank lines in the OP. – ev-br Nov 30 '12 at 14:01
3

You should ideally use the MDAnalysis package for a pythonic way of "slicing" atoms and segments and calculating distance measures among them. In fact, MDAnalysis supports several MD simulation and chemical structure formats.

For a little more verbose example, see also the following entry on Biostars.org.

Paul Rigor
  • 986
  • 1
  • 12
  • 23
1

If you are interested in just one pair, bash works just fine. This is a script I use, I have it set to relaunch at the end (turn this off if you wish). It will prompt you for which atom. PDB files can have different column set up, so for the awk line make sure that the columns match up. Do a test case by hand before using with a new pdb file. This is trivial, but change in the script my pdb file to yours.

#!/usr/bin/env bash

echo " "
echo "Use one letter abbreviations. Case doesn't matter." 
echo "Example: A 17 CA or n 162 cg"

echo " - - - - - - - - - - - - - - - - - -"
#------------First Selection------------

read -e -p "What first atom? " sel1

# echo $sel1
sel1caps=${sel1^^}
# echo "sel1caps="$sel1caps

arr1=($sel1caps)
# echo "arr1[0]= "${arr1[0]}
# echo "arr1[1]= "${arr1[1]}
# echo "arr1[2]= "${arr1[2]}

#To convert one to three letters

if [ ${arr1[0]} = A ] ; then
    SF1=ALA
elif [ ${arr1[0]} = H ] ; then
    SF1=HIS
elif [ ${arr1[0]} = R ] ; then
    SF1=ARG
elif [ ${arr1[0]} = K ] ; then
    SF1=LYS
elif [ ${arr1[0]} = I ] ; then
    SF1=ILE
elif [ ${arr1[0]} = F ] ; then
    SF1=PHE 
elif [ ${arr1[0]} = L ] ; then
    SF1=LEU
elif [ ${arr1[0]} = W ] ; then
    SF1=TRP
elif [ ${arr1[0]} = M ] ; then
    SF1=MET
elif [ ${arr1[0]} = P ] ; then
    SF1=PRO 
elif [ ${arr1[0]} = C ] ; then
    SF1=CYS 
elif [ ${arr1[0]} = N ] ; then
    SF1=ASN
elif [ ${arr1[0]} = V ] ; then
    SF1=VAL
elif [ ${arr1[0]} = G ] ; then
    SF1=GLY 
elif [ ${arr1[0]} = S ] ; then
    SF1=SER 
elif [ ${arr1[0]} = Q ] ; then
    SF1=GLN 
elif [ ${arr1[0]} = Y ] ; then
    SF1=TYR 
elif [ ${arr1[0]} = D ] ; then
    SF1=ASP
elif [ ${arr1[0]} = E ] ; then
    SF1=GLU 
elif [ ${arr1[0]} = T ] ; then
    SF1=THR 
else
    echo "use one letter codes"
    echo "exiting"
    exit
fi

# echo "SF1 ="$SF1

#If there is nothing printing for line 1, check the expression for your pdb file. The spaces may need adjustment at the end.
line1=$(grep -E "${arr1[2]} *?${SF1}(.*?) ${arr1[1]}     " 1A23.pdb)
# echo $line1

ar_l1=($line1)
# echo "ar_l1[1]="${ar_l1[1]}

echo " - - - - - - - - - - - - - - - - - -"
#------------Second Selection------------

read -e -p "What second atom? " sel2

# echo $sel2

sel2caps=${sel2^^}
# echo "sel2caps="$sel2caps

arr2=($sel2caps)
# echo "arr2[0]= "${arr2[0]}
# echo "arr2[1]= "${arr2[1]}
# echo "arr2[2]= "${arr2[2]}

#To convert one to three letters

if [ ${arr2[0]} = A ] ; then
    SF2=ALA
elif [ ${arr2[0]} = H ] ; then
    SF2=HIS
elif [ ${arr2[0]} = R ] ; then
    SF2=ARG
elif [ ${arr2[0]} = K ] ; then
    SF2=LYS
elif [ ${arr2[0]} = I ] ; then
    SF2=ILE
elif [ ${arr2[0]} = F ] ; then
    SF2=PHE 
elif [ ${arr2[0]} = L ] ; then
    SF2=LEU
elif [ ${arr2[0]} = W ] ; then
    SF2=TRP
elif [ ${arr2[0]} = M ] ; then
    SF2=MET
elif [ ${arr2[0]} = P ] ; then
    SF2=PRO 
elif [ ${arr2[0]} = C ] ; then
    SF2=CYS 
elif [ ${arr2[0]} = N ] ; then
    SF2=ASN
elif [ ${arr2[0]} = V ] ; then
    SF2=VAL
elif [ ${arr2[0]} = G ] ; then
    SF2=GLY 
elif [ ${arr2[0]} = S ] ; then
    SF2=SER 
elif [ ${arr2[0]} = Q ] ; then
    SF2=GLN 
elif [ ${arr2[0]} = Y ] ; then
    SF2=TYR 
elif [ ${arr2[0]} = D ] ; then
    SF2=ASP
elif [ ${arr2[0]} = E ] ; then
    SF2=GLU 
elif [ ${arr2[0]} = T ] ; then
    SF2=THR 
else
    echo "use one letter codes"
    echo "exiting"
    exit
fi

# echo "SF2 ="$SF2


line2=$(grep -E "${arr2[2]} *?${SF2}(.*?) ${arr2[1]}     " 1A23.pdb)
# echo $line2

ar_l2=($line2)
# echo "ar_l2[1]="${ar_l2[1]}
# echo "ar_l2[1]="${ar_l2[1]}

atom1=${ar_l1[1]}
atom2=${ar_l2[1]}
echo "==========================="
echo ${arr1[0]}${arr1[1]}${arr1[2]}" to "${arr2[0]}${arr2[1]}${arr2[2]}":"
# 6, 7, 8 are column numbers in the pdb file. 
# If there are multiple molecules it should be 7, 8, 9.
awk '$2=='$atom1'{x1=$7;y1=$8;z1=$9}                                 # get the ATOM 1
     $2=='$atom2'{x2=$7;y2=$8;z2=$9}                               # get the ATOM 2
     END{print sqrt((x1-x2)*(x1-x2) + (y1-y2)*(y1-y2) + (z1-z2)*(z1-z2))}' 1A23.pdb    # calculate the distance.

echo "Angstroms"
echo "==========================="
echo " "
echo "-_-_-_-_Running Script Again_-_-_-_-"
./distance_soln.sh
PhysicalChemist
  • 540
  • 4
  • 14
0

A simple Python code can do the job. I have assumed that all your contents are in file "input.txt".

def process(line1, line2):
    content1 = line1.split()
    content2 = line2.split()
    x1, y1, z1 = float(content1[6]), float(content1[7]), float(content1[8])
    x2, y2, z2 = float(content2[6]), float(content2[7]), float(content2[8])
    distance = math.sqrt(math.pow(x1-x2, 2) + math.pow(y1-y2, 2) + math.pow(z1-z2, 2))
    return content1[3]+"-"+content2[3]+" "+ "{0:.2f}".format(distance)

with open("input.txt") as f:
    line1 = f.readline()
    for line in f:
        line2 = line
        print(process(line1, line2))
        line1 = line2

Please do let me know if you find any discrepancies or issue in using this script.