0

I'm stuck in a conundrum of optimization versus the nature of the program. I have code that's written to extract info from an API and insert it directly into a MongoDB database. I've posted code that is only operating on 4 pages of the API, and it works rather quickly. However, the final program needs to works reasonably well on 40 pages and as of now the program seems to stop after 5. To be clear, It says its completed, but has only collected from 5. To ensure the right information is placed with the right 'collection', which are named from the extraction itself and not manually, the code is built on a serious of nested for loops that are quite slow and pretty hideous to behold. However, I've been whacking at this for a while and I'm having trouble coming up with any other way to do it that gathers the information accurately and puts it in the right place. Again, looking to reduce the number of nested loops. My API key is blocked, so this code will not run. The API is NCBO's BioPortal and you can look at their API here: http://data.bioontology.org/

Thanks!

import urllib2
import json
import ast

from pymongo import MongoClient
from datetime import datetime



REST_URL = "http://data.bioontology.org"
API_KEY = "********"

client=MongoClient()
db=client.db

print "Accessed database."

def get_json(url):
    opener = urllib2.build_opener()
    opener.addheaders = [('Authorization', 'apikey token=' + API_KEY)]
    return json.loads(opener.open(url).read())

# Get all ontologies from the REST service and parse the JSON                                                                                                                
all_ontologies = get_json(REST_URL+"/ontologies")

selected_ontologies= ['MERA','OGROUP','GCO','OCHV']

onts_acronyms=[]
page=None
acronym= None

for ontology in all_ontologies:
   if  ontology["acronym"] in selected_ontologies:
        onts_acronyms.append(ast.literal_eval(json.dumps(ontology["acronym"])))    #cleans names and removes whitespaces using ast package                                   
for acronym in onts_acronyms:
   page=get_json(REST_URL+"/ontologies/"+acronym+"/classes")

   next_page=page
   while next_page:
       next_page=page["links"]["nextPage"]
       for ont_class in page["collection"]:
            result = db[acronym].insert({ont_class["prefLabel"]:
    {"definition":ont_class["definition"],"synonyms":ont_class["synonym"]}}, 
     check_keys=False)
       if next_page:
          page=get_json(next_page)

print "DB Built."
sidgate
  • 14,650
  • 11
  • 68
  • 119
  • Why do you think loops are slow? If you only "think" it, there's another word for that - guessing. Don't guess. ***Find out*** for sure what takes time. Here's a [*simple way to do that*](https://stackoverflow.com/a/4299378/23771). It's probably something you would never guess. – Mike Dunlavey Aug 22 '17 at 14:56
  • I timed my loops. I've also been watching how much of the cpu they've been using. Barely a blip. I'm not guessing. – cobaltchaos Aug 22 '17 at 15:06
  • I tried your method too. It seems the program spends a ton of time on the urllib2 package dependencies and there's nothing I can really do about that. – cobaltchaos Aug 22 '17 at 15:23
  • But which lines in your code are calling urllib2? You call `get_json` a lot, and that calls `urllib2.build_opener()` and then `json.loads(opener.open(url).read())`. If you find these lines on the call stack a lot, it means that's where your time (wall-clock time) is going. It's basically I/O, so of course the CPU time is nil. So no matter how you arrange loops, that's not the problem. What you have to do is try to minimize the I/O. Are any of these URLs getting hit more than once, or hit unnecessarily? – Mike Dunlavey Aug 23 '17 at 14:59
  • There is only one way forward and that is to parallelise it, so you can wait on more urls at the same time. That is if the API allows multiple connections. – Surt Aug 26 '17 at 18:53

0 Answers0