1

I'm writing a program that downloads data from a website (eve-central.com). It returns xml when I send a GET request with some parameters. The problem is that I need to make about 7080 of such requests because i can't specify the typeid parameter more than once.

def get_data_eve_central(typeids, system, hours, minq=1, thread_count=1):
    import xmltodict, urllib3
    pool = urllib3.HTTPConnectionPool('api.eve-central.com')
    for typeid in typeids:
        r = pool.request('GET', '/api/quicklook', fields={'typeid': typeid, 'usesystem': system, 'sethours': hours, 'setminQ': minq})
        answer = xmltodict.parse(r.data)

It was really slow when I just connected to the website and made all the requests so I decided to make it use multiple threads at a time (I read that if the process involves a lot of waiting (I/O, HTTP requests), it can be speeded up a lot with multithreading). I rewrote it using multiple threads, but it somehow isn't any faster (a bit slower in fact). Here's the code rewritten using multithreading:

def get_data_eve_central(all_typeids, system, hours, minq=1, thread_count=1):

    if thread_count > len(all_typeids): raise NameError('TooManyThreads')

    def requester(typeids):
        pool = urllib3.HTTPConnectionPool('api.eve-central.com')
        for typeid in typeids:
            r = pool.request('GET', '/api/quicklook', fields={'typeid': typeid, 'usesystem': system, 'sethours': hours, 'setminQ': minq})
            answer = xmltodict.parse(r.data)['evec_api']['quicklook']
            answers.append(answer)

    def chunkify(items, quantity):
        chunk_len = len(items) // quantity
        rest_count = len(items) % quantity
        chunks = []
        for i in range(quantity):
            chunk = items[:chunk_len]
            items = items[chunk_len:]
            if rest_count and items:
                chunk.append(items.pop(0))
                rest_count -= 1
            chunks.append(chunk)
        return chunks

    t = time.clock()
    threads = []
    answers = []
    for typeids in chunkify(all_typeids, thread_count):
        threads.append(threading.Thread(target=requester, args=[typeids]))
        threads[-1].start()
        threads[-1].join()

    print(time.clock()-t)
    return answers

What I do is I divide all typeids into as many chunks as the quantity of threads i want to use and create a thread for each chunk to process it. The question is: what can slow it down? (I apologise for my bad english)

Ilya Peterov
  • 1,975
  • 1
  • 16
  • 33

1 Answers1

2

Python has Global Interpreter Lock. It can be your problem. Actually Python cannot do it in a genuine parallel way. You may think about switching to other languages or staying with Python but use process-based parallelism to solve your task. Here is a nice presentation Inside the Python GIL

Vadim
  • 567
  • 6
  • 13
  • I think so too, but i can't see, where my threads use the similar resource. They use answers list, yes, but i tried removing it and it was just as slow so I think it's not the problem. – Ilya Peterov Mar 02 '15 at 16:02
  • 1
    Take a look at the nice presentation http://www.dabeaz.com/python/GIL.pdf "Inside the Python GIL". After that I believe you will find out exactly why GIL makes you problems. – Vadim Mar 02 '15 at 16:06
  • I read the presentation (it's a great one by the way, thanks), but i still can't figure out what is the problem in my particular case. The operation that takes most of the time is waiting for the server to respond, but it's an I/O operation and it (as I understood from the presentation) should release the GIL while waiting, which it (judging by the time the program runs) doesn't do. – Ilya Peterov Mar 02 '15 at 16:35
  • I use 'for' to give threads their tasks, I think it's a good idea to try using queues like it was recommended [here](http://stackoverflow.com/questions/6905800/multiprocessing-useless-with-urllib2) – Ilya Peterov Mar 02 '15 at 16:53
  • I rewrote my program using queues and it's really a lot faster. Thanks! – Ilya Peterov Mar 02 '15 at 17:28