Speed up python's request function

Question

I have a script that takes a list file of query IDs and extracts the organism and sequence from uniprot, the code works well, however it is very slow. I want to process approximately 4 million sequences through it, but it takes around 5 min to parse through 100 sequences:

real    5m32.452s
user    0m0.651s
sys 0m0.135s

The code uses python's retrieve module. I've read online that I can use the .session() attribute, however when I try this I get the following error:

Traceback (most recent call last):
File "retrieve.py", line 14, in <module>
result = session.get(baseURL, payload)
TypeError: get() takes exactly 2 arguments (3 given)

The code is listed here:

import requests

baseURL = 'http://www.uniprot.org/uniprot/'

sample = open('sample.txt','r')
out = open('out','w')

for line in sample:
    query = line.strip()
    payload = {
        'query': query, 
        'format':'tab',
        'columns': 'id, entry_name, organism, sequence'
    }
    result = requests.get(baseURL, payload)
    if result.ok:
        out.write(query + '\t' + result.text[41:] + '\n')

Example input format:

EDP09046
ONI31767
ENSFALT00000002630
EAS32469
ENSXETT00000048864

Example output format:

EDP09046 R6X9 A0A251R6X9_PRUPE Prunus persica (Peach) (Amygdalus persica) MEENHAPALESIPNGDHEAATTTNDFNTHIHTNNDHGWQKVTAKRQRKTKPSKADSINNLNKLVPGVTIAGGEGVFRSLEKQSEDRRRRILEAQRAANADADSLAPVRSKLRSDDEDGEDSDDESVAQNVKAEEAKKSKPKKPKKPKVTVAEAAAKIDDANDLSAFLIDISASYESKEDIQLMRFADYFGRAFSAVTAAQFPWVKMFRESTVAKLADIPLSHISEAVYKTSVDWISQRSLEALGSFILWSLDSILADLASQVAGAKGSKKSVQNVSSKSQVAIFVVVAMVLRKKPDVLISILPTLRENSKYQGQDKLPVIVWAISQASQGDLAVGLHSWAHIVLPLVSGKGSNPQSRDLILQLAERILSTPKARTILVNGAVRKGERLVPPSAFEILIGVTFPAPSARVKATERFEAIYPTLKAVALAGSPRSKAMKQVSLQILSFAVKAAGESIPALSNEATGIFIWCLTQHADCFKQWDKVYQENLEASVAVLKKLSDQWKEHSAKLAPFDPMRETLKSFRHKNEKMLASGEDEAHQEKLIKDADKYCKTLLGKSSRGSGCKKSVALAVVALAVGAAVMSPNMESWDWDLEKLRVTISSFFD

Can anyone suggest some ways that I may improve this code to make it faster?

Thanks in advance!

Can you clarify, you want to use requests.get 4 million times? In other words, how many lines are in your samples.txt file — jrjames83, Feb 19 '18 at 19:23
Just a quick&dirty solution ... use `xargs` to "paralelize" the work in a cheap way. Create a python file that accepts 1 sys.argv (the query), does "the stuff" and print the result. Use `cat` to spill out the queries and pipe it "|" to xargs `xargs -n 1 -P 10 python -u ./script.py` (the `-P` is the number of processes to run in parallel). Add some eventual extra "tinkering" with awk if necessary — Lohmar ASHAR, Feb 19 '18 at 19:25
https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor-example (try using maybe 15 threads (workers argument)). Also try calling requests with requests.get(url, params=payload) to see if it squashes the args error. — jrjames83, Feb 19 '18 at 19:28
@jrjames83 yes, this is a perfect candidate for threading in Python. See [this related question](https://stackoverflow.com/questions/2632520/what-is-the-fastest-way-to-send-100-000-http-requests-in-python) However, since the OP is hitting some (likely rate-limited API) then really you'll likely just want to use serially batched API calls, as per the answer posted in this question. — juanpa.arrivillaga, Feb 19 '18 at 19:41
Thanks so much for your input, I'll try everything out and report back — Oddish, Feb 19 '18 at 19:45

score 5 · Answer 1 · answered Feb 19 '18 at 19:29

5

Requests are almost always the slowest portion of any networking code so you'll absolutely want to batch your IDs. Uniprot has a batching capability in it's API. There's a Perl example on that page that should help you get started – I'd look at what the batch size limit is and go for the largest as a starting point (it's likely much smaller than 4,000,000). As noted on the Uniprot site, there's also an ID mapping service that may fit the bill.

answered Feb 19 '18 at 19:29

Greenstick

8,632
1
24
29

Thanks for this, I had a look at this before and I was advised against it as my Query IDs come from a wealth of databases (product of a multi-lab collaboration) it actually takes an age to convert IDs to ACC, the code I have above doesn't require the queries to be in ACC – Oddish Feb 19 '18 at 19:45
Apologies on the delay, sounds like this may be trickier that I originally expected. I’m unfamiliar with the acronym ACC in this context, can you provide some more detail on it? – Greenstick Feb 19 '18 at 23:53
I just realized that ACC probably means accession number, did you ever find a solution that worked for you? – Greenstick Mar 14 '18 at 19:00

Speed up python's request function

1 Answers1