I have a script that takes a list file of query IDs and extracts the organism and sequence from uniprot, the code works well, however it is very slow. I want to process approximately 4 million sequences through it, but it takes around 5 min to parse through 100 sequences:
real 5m32.452s
user 0m0.651s
sys 0m0.135s
The code uses python's retrieve module. I've read online that I can use the .session() attribute, however when I try this I get the following error:
Traceback (most recent call last):
File "retrieve.py", line 14, in <module>
result = session.get(baseURL, payload)
TypeError: get() takes exactly 2 arguments (3 given)
The code is listed here:
import requests
baseURL = 'http://www.uniprot.org/uniprot/'
sample = open('sample.txt','r')
out = open('out','w')
for line in sample:
query = line.strip()
payload = {
'query': query,
'format':'tab',
'columns': 'id, entry_name, organism, sequence'
}
result = requests.get(baseURL, payload)
if result.ok:
out.write(query + '\t' + result.text[41:] + '\n')
Example input format:
EDP09046
ONI31767
ENSFALT00000002630
EAS32469
ENSXETT00000048864
Example output format:
EDP09046 R6X9 A0A251R6X9_PRUPE Prunus persica (Peach) (Amygdalus persica) MEENHAPALESIPNGDHEAATTTNDFNTHIHTNNDHGWQKVTAKRQRKTKPSKADSINNLNKLVPGVTIAGGEGVFRSLEKQSEDRRRRILEAQRAANADADSLAPVRSKLRSDDEDGEDSDDESVAQNVKAEEAKKSKPKKPKKPKVTVAEAAAKIDDANDLSAFLIDISASYESKEDIQLMRFADYFGRAFSAVTAAQFPWVKMFRESTVAKLADIPLSHISEAVYKTSVDWISQRSLEALGSFILWSLDSILADLASQVAGAKGSKKSVQNVSSKSQVAIFVVVAMVLRKKPDVLISILPTLRENSKYQGQDKLPVIVWAISQASQGDLAVGLHSWAHIVLPLVSGKGSNPQSRDLILQLAERILSTPKARTILVNGAVRKGERLVPPSAFEILIGVTFPAPSARVKATERFEAIYPTLKAVALAGSPRSKAMKQVSLQILSFAVKAAGESIPALSNEATGIFIWCLTQHADCFKQWDKVYQENLEASVAVLKKLSDQWKEHSAKLAPFDPMRETLKSFRHKNEKMLASGEDEAHQEKLIKDADKYCKTLLGKSSRGSGCKKSVALAVVALAVGAAVMSPNMESWDWDLEKLRVTISSFFD
Can anyone suggest some ways that I may improve this code to make it faster?
Thanks in advance!