Python selecting items by comparing values in a table using dictionary

Question

I have a table with 12 columns and want to select the items in the first column (qseqid) based on the second column (sseqid). Meaning that the second column (sseqid) is repeating with different values in the 11th and 12th columns, which areevalueandbitscore, respectively. The ones that I would like to get are having the lowestevalueand the highestbitscore(whenevalues are the same, the rest of the columns can be ignored and the data is down below).

So, I have made a short code which uses the second columns as a key for the dictionary. I can get five different items from the second column with lists of qseqid+evalueandqseqid+bitscore.

Here is the code:

#!usr/bin/python

filename = "data.txt"

readfile = open(filename,"r")

d = dict()

for i in readfile.readlines():
    i = i.strip()
    i = i.split("\t")
    d.setdefault(i[1], []).append([i[0],i[10]])
    d.setdefault(i[1], []).append([i[0],i[11]])

for x in d:
    print(x,d[x])

readfile.close()

But, I am struggling to get the qseqid with the lowest evalue and the highest bitscore for each sseqid. Is there any good logic to solve the problem?

Thedata.txtfile (including the header row and with»representing tab characters)

qseqid»sseqid»pident»length»mismatch»gapopen»qstart»qend»sstart»send»evalue»bitscore
ACLA_022040»TBB»32.71»431»258»8»39»468»24»423»2.00E-76»240
ACLA_024600»TBB»80»435»87»0»1»435»1»435»0»729
ACLA_031860»TBB»39.74»453»251»3»1»447»1»437»1.00E-121»357
ACLA_046030»TBB»75.81»434»105»0»1»434»1»434»0»704
ACLA_072490»TBB»41.7»446»245»3»4»447»3»435»2.00E-120»353
ACLA_010400»EF1A»27.31»249»127»8»69»286»9»234»3.00E-13»61.6
ACLA_015630»EF1A»22»491»255»17»186»602»3»439»8.00E-19»78.2
ACLA_016510»EF1A»26.23»122»61»4»21»127»9»116»2.00E-08»46.2
ACLA_023300»EF1A»29.31»447»249»12»48»437»3»439»2.00E-45»155
ACLA_028450»EF1A»85.55»443»63»1»1»443»1»442»0»801
ACLA_074730»CALM»23.13»147»101»4»6»143»2»145»7.00E-08»41.2
ACLA_096170»CALM»29.33»150»96»4»34»179»2»145»1.00E-13»55.1
ACLA_016630»CALM»23.9»159»106»5»58»216»4»147»5.00E-12»51.2
ACLA_031930»RPB2»36.87»1226»633»24»121»1237»26»1219»0»734
ACLA_065630»RPB2»65.79»1257»386»14»1»1252»4»1221»0»1691
ACLA_082370»RPB2»27.69»1228»667»37»31»1132»35»1167»7.00E-110»365
ACLA_061960»ACT»28.57»147»95»5»146»284»69»213»3.00E-12»57.4
ACLA_068200»ACT»28.73»463»231»13»16»471»4»374»1.00E-53»176
ACLA_069960»ACT»24.11»141»97»4»581»718»242»375»9.00E-09»46.2
ACLA_095800»ACT»91.73»375»31»0»1»375»1»375»0»732

And here's a little more readable version of the table's contents:

0            1           2      3        4        5      6    7      8    9        10       11
qseqid       sseqid pident length mismatch  gapopen qstart qend sstart send    evalue bitscore
ACLA_022040  TBB     32.71    431      258        8    39   468     24  423  2.00E-76      240
ACLA_024600  TBB        80    435       87        0     1   435      1  435         0      729
ACLA_031860  TBB     39.74    453      251        3     1   447      1  437 1.00E-121      357
ACLA_046030  TBB     75.81    434      105        0     1   434      1  434         0      704
ACLA_072490  TBB      41.7    446      245        3     4   447      3  435 2.00E-120      353
ACLA_010400  EF1A    27.31    249      127        8    69   286      9  234  3.00E-13     61.6
ACLA_015630  EF1A       22    491      255       17   186   602      3  439  8.00E-19     78.2
ACLA_016510  EF1A    26.23    122       61        4    21   127      9  116  2.00E-08     46.2
ACLA_023300  EF1A    29.31    447      249       12    48   437      3  439  2.00E-45      155
ACLA_028450  EF1A    85.55    443       63        1     1   443      1  442         0      801
ACLA_074730  CALM    23.13    147      101        4     6   143      2  145  7.00E-08     41.2
ACLA_096170  CALM    29.33    150       96        4    34   179      2  145  1.00E-13     55.1
ACLA_016630  CALM     23.9    159      106        5    58   216      4  147  5.00E-12     51.2
ACLA_031930  RPB2    36.87   1226      633       24   121  1237     26 1219         0      734
ACLA_065630  RPB2    65.79   1257      386       14     1  1252      4 1221         0     1691
ACLA_082370  RPB2    27.69   1228      667       37    31  1132     35 1167 7.00E-110      365
ACLA_061960  ACT     28.57    147       95        5   146   284     69  213  3.00E-12     57.4
ACLA_068200  ACT     28.73    463      231       13    16   471      4  374  1.00E-53      176
ACLA_069960  ACT     24.11    141       97        4   581   718    242  375  9.00E-09     46.2
ACLA_095800  ACT     91.73    375       31        0     1   375      1  375         0      732

What is the expected result? – thefourtheye Jan 10 '15 at 04:08 — thefourtheye, Jan 10 '15 at 04:08

DSM · Answer 1 · 2015-01-10T13:40:28.293

Since you're a Python newbie I'm glad that there are several examples of how to this manually, but for comparison I'll show how it can be done using the pandas library which makes working with tabular data much simpler.

Since you didn't provide example output, I'm assuming that by "with the lowest evalue and the highest bitscore for each sseqid" you mean "the highest bitscore among the lowest evalues" for a given sseqid; if you want those separately, that's trivial too.

import pandas as pd
df = pd.read_csv("acla1.dat", sep="\t")
df = df.sort(["evalue", "bitscore"],ascending=[True, False])
df_new = df.groupby("sseqid", as_index=False).first()

which produces

>>> df_new
  sseqid       qseqid  pident  length  mismatch  gapopen  qstart  qend  sstart  send        evalue  bitscore
0    ACT  ACLA_095800   91.73     375        31        0       1   375       1   375  0.000000e+00     732.0
1   CALM  ACLA_096170   29.33     150        96        4      34   179       2   145  1.000000e-13      55.1
2   EF1A  ACLA_028450   85.55     443        63        1       1   443       1   442  0.000000e+00     801.0
3   RPB2  ACLA_065630   65.79    1257       386       14       1  1252       4  1221  0.000000e+00    1691.0
4    TBB  ACLA_024600   80.00     435        87        0       1   435       1   435  0.000000e+00     729.0

Basically, first we read the data file into an object called a DataFrame, which is kind of like an Excel worksheet. Then we sort by evalue ascending (so that lower evalues come first) and by bitscore descending (so that higher bitscores come first). Then we can use groupby to collect the data in groups of equal sseqid, and take the first one in each group, which because of the sorting will be the one we want.

This is a really good solution, but pandas module requires many other modules which are needed to be installed first. I can use Python on Linux, but unfortunately, I am currently working with Windows version, stuck on installing Six module. Anyway, good thank you. — Karyo, Jan 10 '15 at 14:56

score 2 · Accepted Answer · answered Jan 10 '15 at 04:17

#!usr/bin/python
import csv

DATA = "data.txt"

class Sequence:
    def __init__(self, row):
        self.qseqid   =       row[0]
        self.sseqid   =       row[1]
        self.pident   = float(row[2])
        self.length   =   int(row[3])
        self.mismatch =   int(row[4])
        self.gapopen  =   int(row[5])
        self.qstart   =   int(row[6])
        self.qend     =   int(row[7])
        self.sstart   =   int(row[8])
        self.send     =   int(row[9])
        self.evalue   = float(row[10])
        self.bitscore = float(row[11])

    def __str__(self):
        return (
            "{qseqid}\t"
            "{sseqid}\t"
            "{pident}\t"
            "{length}\t"
            "{mismatch}\t"
            "{gapopen}\t"
            "{qstart}\t"
            "{qend}\t"
            "{sstart}\t"
            "{send}\t"
            "{evalue}\t"
            "{bitscore}"
        ).format(**self.__dict__)

def entries(fname, header_rows=1, dtype=list, **kwargs):
    with open(fname) as inf:
        incsv = csv.reader(inf, **kwargs)

        # skip header rows
        for i in range(header_rows):
            next(incsv)

        for row in incsv:
            yield dtype(row)

def main():
    bestseq = {}
    for seq in entries(DATA, dtype=Sequence, delimiter="\t"):
        # see if a sequence with the same sseqid already exists
        prev = bestseq.get(seq.sseqid, None)

        if (
            prev is None
            or seq.evalue < prev.evalue
            or (seq.evalue == prev.evalue and seq.bitscore > prev.bitscore)
        ):
            bestseq[seq.sseqid] = seq

    # display selected sequences
    keys = sorted(bestseq)
    for key in keys:
        print(bestseq[key])

if __name__ == "__main__":
    main()

which results in

ACLA_095800     ACT     91.73   375     31      0       1       375     1       375     0.0     732.0
ACLA_096170     CALM    29.33   150     96      4       34      179     2       145     1e-13   55.1
ACLA_028450     EF1A    85.55   443     63      1       1       443     1       442     0.0     801.0
ACLA_065630     RPB2    65.79   1257    386     14      1       1252    4       1221    0.0     1691.0
ACLA_024600     TBB     80.0    435     87      0       1       435     1       435     0.0     729.0

martineau · Answer 3 · 2015-01-10T20:08:11.390

While not nearly as elegant and concise as using thepandaslibrary, it's quite possible to do what you want without resorting to third-party modules. The following uses thecollections.defaultdictclass to facilitate creation of dictionaries of variable-length lists of records. The use of theAttrDictclass is optional, but it makes accessing the fields of each dictionary-based records easier and is less awkward-looking than the usualdict['fieldname']syntax otherwise required.

import csv
from collections import defaultdict, namedtuple
from itertools import imap
from operator import itemgetter

data_file_name = 'data.txt'
DELIMITER = '\t'
ssqeid_dict = defaultdict(list)

# from http://stackoverflow.com/a/1144405/355230
def multikeysort(items, columns):
    comparers = [((itemgetter(col[1:].strip()), -1) if col.startswith('-') else
                  (itemgetter(col.strip()),      1)) for col in columns]
    def comparer(left, right):
        for fn, mult in comparers:
            result = cmp(fn(left), fn(right))
            if result:
                return mult * result
        else:
            return 0
    return sorted(items, cmp=comparer)

# from http://stackoverflow.com/a/15109345/355230
class AttrDict(dict):
    def __init__(self, *args, **kwargs):
        super(AttrDict, self).__init__(*args, **kwargs)
        self.__dict__ = self

with open(data_file_name, 'rb') as data_file:
    reader = csv.DictReader(data_file, delimiter=DELIMITER)
    format_spec = '\t'.join([('{%s}' % field) for field in reader.fieldnames])
    for rec in (AttrDict(r) for r in reader):
        # Convert the two sort fields to numeric values for proper ordering.
        rec.evalue, rec.bitscore = map(float, (rec.evalue, rec.bitscore))
        ssqeid_dict[rec.sseqid].append(rec)

for ssqeid in sorted(ssqeid_dict):
    # Sort each group of recs with same ssqeid. The first record after sorting
    # will be the one sought that has the lowest evalue and highest bitscore.
    selected = multikeysort(ssqeid_dict[ssqeid], ['evalue', '-bitscore'])[0]
    print format_spec.format(**selected)

Output (»represents tabs):

ACLA_095800»    ACT»    91.73»  375»    31»     0»      1»      375»    1»      375»    0.0»    732.0
ACLA_096170»    CALM»   29.33»  150»    96»     4»      34»     179»    2»      145»    1e-13»  55.1
ACLA_028450»    EF1A»   85.55»  443»    63»     1»      1»      443»    1»      442»    0.0»    801.0
ACLA_065630»    RPB2»   65.79»  1257»   386»    14»     1»      1252»   4»      1221»   0.0»    1691.0
ACLA_024600»    TBB»    80»     435»    87»     0»      1»      435»    1»      435»    0.0»    729.0

This is good, but the output does not show the qsequid. Thank you for fixing my question. — Karyo, Jan 10 '15 at 14:57
Thanks. Almost laughable that after all the reformatting of your question _I'm_ the one who didn't understand the results you wanted. Anyway, I updated my answer in that respect now that I understand what it is better. — martineau, Jan 10 '15 at 20:14

score 0 · Answer 4 · answered Jan 10 '15 at 04:40

filename = 'data.txt'

readfile = open(filename,'r')

d = dict()
sseqid=[]
lines=[]
for i in readfile.readlines():
    sseqid.append(i.rsplit()[1])
    lines.append(i.rsplit())

sorted_sseqid = sorted(set(sseqid))

sdqDict={}
key =None

for sorted_ssqd in sorted_sseqid:

    key=sorted_ssqd
    evalue=[]
    bitscore=[]
    qseid=[]
    for line in lines:
        if key in line:
            evalue.append(line[10])
            bitscore.append(line[11])
            qseid.append(line[0])
    sdqDict[key]=[qseid,evalue,bitscore]

print sdqDict

print 'TBB LOWEST EVALUE' + '---->' + min(sdqDict['TBB'][1])
##I think you can do the list manipulation below to find out the qseqid

readfile.close()

Python selecting items by comparing values in a table using dictionary

4 Answers4