18

Now i am studying how to fetch data from website as fast as possible. To get faster speed, im considering using multi-thread. Here is the code i used to test the difference between multi-threaded and simple post.

import threading
import time
import urllib
import urllib2


class Post:

    def __init__(self, website, data, mode):
        self.website = website
        self.data = data

        #mode is either "Simple"(Simple POST) or "Multiple"(Multi-thread POST)
        self.mode = mode

    def post(self):

        #post data
        req = urllib2.Request(self.website)
        open_url = urllib2.urlopen(req, self.data)

        if self.mode == "Multiple":
            time.sleep(0.001)

        #read HTMLData
        HTMLData = open_url.read()



        print "OK"

if __name__ == "__main__":

    current_post = Post("http://forum.xda-developers.com/login.php", "vb_login_username=test&vb_login_password&securitytoken=guest&do=login", \
                        "Simple")

    #save the time before post data
    origin_time = time.time()

    if(current_post.mode == "Multiple"):

        #multithreading POST

        for i in range(0, 10):
           thread = threading.Thread(target = current_post.post)
           thread.start()
           thread.join()

        #calculate the time interval
        time_interval = time.time() - origin_time

        print time_interval

    if(current_post.mode == "Simple"):

        #simple POST

        for i in range(0, 10):
            current_post.post()

        #calculate the time interval
        time_interval = time.time() - origin_time

        print time_interval

just as you can see, this is a very simple code. first i set the mode to "Simple", and i can get the time interval: 50s(maybe my speed is a little slow :(). then i set the mode to "Multiple", and i get the time interval: 35. from that i can see, multi-thread can actually increase the speed, but the result isnt as good as i imagine. i want to get a much faster speed.

from debugging, i found that the program mainly blocks at the line: open_url = urllib2.urlopen(req, self.data), this line of code takes a lot of time to post and receive data from the specified website. i guess maybe i can get a faster speed by adding time.sleep() and using multi-threading inside the urlopen function, but i cannot do that because its the python's own function.

if not considering the prossible limits that the server blocks the post speed, what else can i do to get the faster speed? or any other code i can modify? thx a lot!

Searene
  • 25,920
  • 39
  • 129
  • 186
  • 2
    threading is a bad idea in python, it gets bottlenecked easily and can get trapped by the GIL, try multiprocessing. – Jakob Bowyer Apr 14 '12 at 15:02
  • 2
    @JakobBowyer: threads are an implementation detail here, the real focus is having multiple connections open. The GIL aspect of threading in Python has no role here whatsoever. – orlp Apr 14 '12 at 16:50
  • 2
    @nightcracker, you really should read up on GIL and threading before making statements like that... start here: [PyCon 2010: Understanding the Python GIL](http://python.mirocommunity.org/video/1479/pycon-2010-understanding-the-p) – Mike Pennington Apr 16 '12 at 18:24

4 Answers4

18

The biggest thing you are doing wrong, that is hurting your throughput the most, is the way you are calling thread.start() and thread.join():

for i in range(0, 10):
   thread = threading.Thread(target = current_post.post)
   thread.start()
   thread.join()

Each time through the loop, you create a thread, start it, and then wait for it to finish Before moving on to the next thread. You aren't doing anything concurrently at all!

What you should probably be doing instead is:

threads = []

# start all of the threads
for i in range(0, 10):
   thread = threading.Thread(target = current_post.post)
   thread.start()
   threads.append(thread)

# now wait for them all to finish
for thread in threads:
   thread.join()
SingleNegationElimination
  • 151,563
  • 33
  • 264
  • 304
  • I didn't even look that far down. Join after start again :( – Martin James Apr 14 '12 at 17:28
  • This is an incremental improvement, but no matter what python's existing threads are awful. We should be recommending multiprocessing; see my answer. – Mike Pennington Apr 16 '12 at 15:57
  • @Mike: this is not an incremental improvement at all; using the Code MarkZar provided, It improved run time in my tests from around 20 secoonds to less than half a second. this makes sense, since http uses minimal CPU but is highly sensitive to network latency, and so using `threading` instead of `multiprocessing` is a totally reasonable solution. This goes double if a Keep-Alive http client were used (`urlib3` was about 30% faster than the `urllib2` in my fixed threading tests, no improvement otherwise), which wouldn't be available across processes. – SingleNegationElimination Apr 16 '12 at 17:40
  • @TokenMacGuy, HTTP in python can use considerable CPU while the query is parsed. That's really besides the point, as David Beazley's presentation makes very clear. There is no good scheduling solution between threads in Python... as you can see multiprocessing is significantly faster than python threads. – Mike Pennington Apr 16 '12 at 17:46
  • "Final Thoughts: Don't use this talk to justify not using threads. Threads are a very useful programming tool for many kinds of concurrency problems. Threads can also offer excellent performance even with the GIL (you need to study it)." Cf: [Understanding the GIL by David Beazley](http://www.dabeaz.com/python/UnderstandingGIL.pdf) – gaborous Nov 15 '12 at 15:30
  • 1
    @user1121352, that's right... I used *data* to justify multiprocessing vs threads... I did not merely use his presentation – Mike Pennington Nov 16 '12 at 09:22
16

In many cases, python's threading doesn't improve execution speed very well... sometimes, it makes it worse. For more information, see David Beazley's PyCon2010 presentation on the Global Interpreter Lock / Pycon2010 GIL slides. This presentation is very informative, I highly recommend it to anyone considering threading...

Even though David Beazley's talk explains that network traffic improves the scheduling of Python threading module, you should use the multiprocessing module. I included this as an option in your code (see bottom of my answer).

Running this on one of my older machines (Python 2.6.6):

current_post.mode == "Process"  (multiprocessing)  --> 0.2609 seconds
current_post.mode == "Multiple" (threading)        --> 0.3947 seconds
current_post.mode == "Simple"   (serial execution) --> 1.650 seconds

I agree with TokenMacGuy's comment and the numbers above include moving the .join() to a different loop. As you can see, python's multiprocessing is significantly faster than threading.


from multiprocessing import Process
import threading
import time
import urllib
import urllib2


class Post:

    def __init__(self, website, data, mode):
        self.website = website
        self.data = data

        #mode is either:
        #   "Simple"      (Simple POST)
        #   "Multiple"    (Multi-thread POST)
        #   "Process"     (Multiprocessing)
        self.mode = mode
        self.run_job()

    def post(self):

        #post data
        req = urllib2.Request(self.website)
        open_url = urllib2.urlopen(req, self.data)

        if self.mode == "Multiple":
            time.sleep(0.001)

        #read HTMLData
        HTMLData = open_url.read()

        #print "OK"

    def run_job(self):
        """This was refactored from the OP's code"""
        origin_time = time.time()
        if(self.mode == "Multiple"):

            #multithreading POST
            threads = list()
            for i in range(0, 10):
               thread = threading.Thread(target = self.post)
               thread.start()
               threads.append(thread)
            for thread in threads:
               thread.join()
            #calculate the time interval
            time_interval = time.time() - origin_time
            print "mode - {0}: {1}".format(method, time_interval)

        if(self.mode == "Process"):

            #multiprocessing POST
            processes = list()
            for i in range(0, 10):
               process = Process(target=self.post)
               process.start()
               processes.append(process)
            for process in processes:
               process.join()
            #calculate the time interval
            time_interval = time.time() - origin_time
            print "mode - {0}: {1}".format(method, time_interval)

        if(self.mode == "Simple"):

            #simple POST
            for i in range(0, 10):
                self.post()
            #calculate the time interval
            time_interval = time.time() - origin_time
            print "mode - {0}: {1}".format(method, time_interval)
        return time_interval

if __name__ == "__main__":

    for method in ["Process", "Multiple", "Simple"]:
        Post("http://forum.xda-developers.com/login.php", 
            "vb_login_username=test&vb_login_password&securitytoken=guest&do=login",
            method
            )
Mike Pennington
  • 41,899
  • 19
  • 136
  • 174
  • thx a lot. multiprocessing is a good idea, its indeed a little faster than multi-threading on my computer. thx all of you. i learned a lot from the question. – Searene Apr 18 '12 at 13:03
  • 1
    @MarkZar, I would say a 33% improvement in speed is more than a little faster, but regardless I wish you well on your project. – Mike Pennington Apr 18 '12 at 14:18
  • On a code of mine, that just elaborates .ods files with pyexcel_ods library and based on 200 threads / processes (or 1 if simple mode), a similar behavior gives: Simple = 16s Multiple = 28s (???) Process = 6s Thank you man. – Manuel Fedele Sep 22 '18 at 16:30
3

Keep in mind that the only case where multi-threading can "increase speed" in Python is when you have operations like this one that are heavily I/O bound. Otherwise multi-threading does not increase "speed" since it can not run on more than one CPU (no, not even if you have multiple cores, python doesn't work that way). You should use multi-threading when you want two things to be done at the same time, not when you want two things to be parallel (i.e. two processes running separately).

Now, what you're actually doing will not actually increase the speed of any single DNS lookup, but it will allow for multiple requests to be shot off while waiting for the results of some others, but you should be careful of how many you do or you will just make the response times even worse than they already are.

Also please stop using urllib2, and use Requests: http://docs.python-requests.org

Wes
  • 2,100
  • 1
  • 17
  • 31
0

A DNS lookup takes time. There's nothing you can do about it. Large latencies are one reason to use multiple threads in the first place - multiple lookups ad site GET/POST can then happen in parallel.

Dump the sleep() - it's not helping.

Martin James
  • 24,453
  • 3
  • 36
  • 60
  • Thx, but i only confused why `time.sleep()` is useless. Indeed, it also works well after dumping `sleep()`, but how can it realize multi-thread without `sleep()`? does python run different threads at random intervals automatically? if so, whats the use of `sleep()` function? – Searene Apr 15 '12 at 07:59
  • It is not useless, merely inappropriate here. Use of sleep - there are loads. 'After turning on the pump, wait at least ten seconds for the pressure to stabilize before opening the feed valve'. – Martin James Apr 15 '12 at 23:08