3

I'm a Python noob. After a few hours of googling, and searching stackoverflow , I failed to find a solution to my problem:

I use an external script to read files containing information about molecule activities. Once read the data will be in a list in the following form:

INACT67481 -10.84

That is, name of the molecule and it's activity value, separated by a single space. The length of the name of the molecule varies greatly.

Now, trouble is, each molecule may have multiple(up to n) values, and only the highest should be preserved, while making sure the order is not changed(beyond removing the duplicates with smaller values).

With the help of threads such as this and this, I know how I could simply delete the duplicates, but am rather lost as to how I could only delete the one with the smallest value, without resorting to a horrible mess of loops.

EDIT: I can also rewrite the file-parsing script in python, if having the data in a different form would prove easier.

EDIT: Sample data:
CHEMBL243059.smi 11.75
CHEMBL115092.smi 10.49
CHEMBL244771.smi 10.79
CHEMBL471221.smi 10.78
CHEMBL573301.smi 10.77
CHEMBL469583.smi 10.77
CHEMBL115092.smi 10.97
CHEMBL244771.smi 8.95
CHEMBL16781.smi 10.76
CHEMBL440776.smi 10.76
CHEMBL243059.smi 10.75
CHEMBL115092.smi 10.69

Should return:

CHEMBL243059.smi 11.75
CHEMBL244771.smi 10.79
CHEMBL471221.smi 10.78
CHEMBL573301.smi 10.77
CHEMBL469583.smi 10.77
CHEMBL115092.smi 10.97
CHEMBL16781.smi 10.76
CHEMBL440776.smi 10.76

Community
  • 1
  • 1
Bohren
  • 45
  • 5
  • 1
    I think you're looking for the [unique_everseen](http://docs.python.org/2/library/itertools.html#recipes) recipe. *"List unique elements, preserving order. Remember all elements ever seen."* – Ashwini Chaudhary Jun 11 '13 at 10:09
  • Please post some sample data and examples. – Ashwini Chaudhary Jun 11 '13 at 10:17
  • Added sample data. As for duplicate, it is close, hence the link to that very thread in the OP. However, it does not seem to deal with the possibility of the first encountered entry having a lower value than the second/third/etc – Bohren Jun 11 '13 at 10:47

1 Answers1

2
from collections import OrderedDict

D = OrderedDict()

with open("fin.txt") as fin:
    for line in fin:
        if line.isspace():   # Guard against empty lines
            continue
        molecule, sep, activity = line.partition(" ")
        activity = float(activity)
        if molecule in D:
            if activity > D[molecule]:
                D[molecule] = activity
                D.move_to_end(molecule)
        else:
            d[molecule] = activity
John La Rooy
  • 295,403
  • 53
  • 369
  • 502
  • While I currently have access to Python 3.2, other users who will be using it will only have 2.7, and unfortunately move_to_end is not available in it. – Bohren Jun 11 '13 at 10:45
  • @Bohren, no problem, you can `del D[molecule]` followed by `D[molecule] = activity` – John La Rooy Jun 11 '13 at 13:05