1

I have a script that takes a list file of query IDs and extracts the organism and sequence from uniprot, the code works well, however it is very slow. I want to process approximately 4 million sequences through it, but it takes around 5 min to parse through 100 sequences:

real    5m32.452s
user    0m0.651s
sys 0m0.135s

The code uses python's retrieve module. I've read online that I can use the .session() attribute, however when I try this I get the following error:

Traceback (most recent call last):
File "retrieve.py", line 14, in <module>
result = session.get(baseURL, payload)
TypeError: get() takes exactly 2 arguments (3 given)

The code is listed here:

import requests

baseURL = 'http://www.uniprot.org/uniprot/'

sample = open('sample.txt','r')
out = open('out','w')

for line in sample:
    query = line.strip()
    payload = {
        'query': query, 
        'format':'tab',
        'columns': 'id, entry_name, organism, sequence'
    }
    result = requests.get(baseURL, payload)
    if result.ok:
        out.write(query + '\t' + result.text[41:] + '\n')

Example input format:

EDP09046
ONI31767
ENSFALT00000002630
EAS32469
ENSXETT00000048864

Example output format:

EDP09046 R6X9 A0A251R6X9_PRUPE Prunus persica (Peach) (Amygdalus persica) MEENHAPALESIPNGDHEAATTTNDFNTHIHTNNDHGWQKVTAKRQRKTKPSKADSINNLNKLVPGVTIAGGEGVFRSLEKQSEDRRRRILEAQRAANADADSLAPVRSKLRSDDEDGEDSDDESVAQNVKAEEAKKSKPKKPKKPKVTVAEAAAKIDDANDLSAFLIDISASYESKEDIQLMRFADYFGRAFSAVTAAQFPWVKMFRESTVAKLADIPLSHISEAVYKTSVDWISQRSLEALGSFILWSLDSILADLASQVAGAKGSKKSVQNVSSKSQVAIFVVVAMVLRKKPDVLISILPTLRENSKYQGQDKLPVIVWAISQASQGDLAVGLHSWAHIVLPLVSGKGSNPQSRDLILQLAERILSTPKARTILVNGAVRKGERLVPPSAFEILIGVTFPAPSARVKATERFEAIYPTLKAVALAGSPRSKAMKQVSLQILSFAVKAAGESIPALSNEATGIFIWCLTQHADCFKQWDKVYQENLEASVAVLKKLSDQWKEHSAKLAPFDPMRETLKSFRHKNEKMLASGEDEAHQEKLIKDADKYCKTLLGKSSRGSGCKKSVALAVVALAVGAAVMSPNMESWDWDLEKLRVTISSFFD

Can anyone suggest some ways that I may improve this code to make it faster?

Thanks in advance!

Greenstick
  • 8,632
  • 1
  • 24
  • 29
Oddish
  • 149
  • 1
  • 9
  • Can you clarify, you want to use requests.get 4 million times? In other words, how many lines are in your samples.txt file – jrjames83 Feb 19 '18 at 19:23
  • Well, essentially yes. Unless there is a better way? – Oddish Feb 19 '18 at 19:25
  • Just a quick&dirty solution ... use `xargs` to "paralelize" the work in a cheap way. Create a python file that accepts 1 sys.argv (the query), does "the stuff" and print the result. Use `cat` to spill out the queries and pipe it "|" to xargs `xargs -n 1 -P 10 python -u ./script.py` (the `-P` is the number of processes to run in parallel). Add some eventual extra "tinkering" with awk if necessary – Lohmar ASHAR Feb 19 '18 at 19:25
  • https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor-example (try using maybe 15 threads (workers argument)). Also try calling requests with requests.get(url, params=payload) to see if it squashes the args error. – jrjames83 Feb 19 '18 at 19:28
  • 1
    @jrjames83 yes, this is a perfect candidate for threading in Python. See [this related question](https://stackoverflow.com/questions/2632520/what-is-the-fastest-way-to-send-100-000-http-requests-in-python) However, since the OP is hitting some (likely rate-limited API) then really you'll likely just want to use serially batched API calls, as per the answer posted in this question. – juanpa.arrivillaga Feb 19 '18 at 19:41
  • Thanks so much for your input, I'll try everything out and report back – Oddish Feb 19 '18 at 19:45

1 Answers1

5

Requests are almost always the slowest portion of any networking code so you'll absolutely want to batch your IDs. Uniprot has a batching capability in it's API. There's a Perl example on that page that should help you get started – I'd look at what the batch size limit is and go for the largest as a starting point (it's likely much smaller than 4,000,000). As noted on the Uniprot site, there's also an ID mapping service that may fit the bill.

Greenstick
  • 8,632
  • 1
  • 24
  • 29
  • Thanks for this, I had a look at this before and I was advised against it as my Query IDs come from a wealth of databases (product of a multi-lab collaboration) it actually takes an age to convert IDs to ACC, the code I have above doesn't require the queries to be in ACC – Oddish Feb 19 '18 at 19:45
  • Apologies on the delay, sounds like this may be trickier that I originally expected. I’m unfamiliar with the acronym ACC in this context, can you provide some more detail on it? – Greenstick Feb 19 '18 at 23:53
  • I just realized that ACC probably means accession number, did you ever find a solution that worked for you? – Greenstick Mar 14 '18 at 19:00