0

The class BrokenLinkTest in the code below does the following.

  1. takes a web page url
  2. finds all the links in the web page
  3. get the headers of the links concurrently (this is done to check if the link is broken or not)
  4. print 'completed' when all the headers are received.

from bs4 import BeautifulSoup
import requests

class BrokenLinkTest(object):

    def __init__(self, url):
        self.url = url
        self.thread_count = 0
        self.lock = threading.Lock()

    def execute(self):
        soup = BeautifulSoup(requests.get(self.url).text)
        self.lock.acquire()
        for link in soup.find_all('a'):
            url = link.get('href')
            threading.Thread(target=self._check_url(url))
        self.lock.acquire()

    def _on_complete(self):
        self.thread_count -= 1
        if self.thread_count == 0: #check if all the threads are completed
            self.lock.release()
            print "completed"

    def _check_url(self, url):
        self.thread_count += 1
        print url
        result = requests.head(url)
        print result
        self._on_complete()


BrokenLinkTest("http://www.example.com").execute()

Can the concurrency/synchronization part be done in a better way. I did it using threading.Lock. This is my first experiment with python threading.

Arun Ghosh
  • 7,634
  • 1
  • 26
  • 38
  • Look at pool.map in https://docs.python.org/2/library/multiprocessing.html . It will make your code so much easier. – Claude Oct 07 '14 at 10:53
  • It's not obvious what you want, what you have and how you expect to get there with what you've done. Please **give example input and output** needed and *explain* what you've been trying to do to achieve this. – Veedrac Oct 07 '14 at 10:59
  • `print` is not thread safe. This will mess up the output. All those threads will randomly make calls to `print` –  Oct 07 '14 at 12:39
  • Look at the code examples that shows how to do multiple concurrent connections and limit (synchornize) them with/without multiple threads: [Limiting number of processes in multiprocessing python](http://stackoverflow.com/q/23236190/4279), [Problem with multi threaded Python app and socket connections](http://stackoverflow.com/q/4783735/4279), [Brute force basic http authorization using httplib and multiprocessing](https://gist.github.com/zed/0a8860f4f9a824561b51), [Is there a way to run cpython on a diffident thread without risking a crash?](http://stackoverflow.com/q/12228783/4279). – jfs Oct 07 '14 at 13:09

2 Answers2

0

All threads in Python run on the same core, so you won't be gaining any performance by doing it this way. Also - it's very unclear what is actually happening?

  1. You are never actually starting a threads, you are just initializing it
  2. The threads themselves do absolutely nothing other than decrementing the thread count

You may only gain performance in a thread-based scenario if your program is delivering work to the IO (sending requests, writing to file and so on), where other threads can work in the meanwhile.

Martol1ni
  • 4,684
  • 2
  • 29
  • 39
  • 4
    Python threads are real OS threads they can run on multiple CPUs. Pure Python code in CPython is protected by global interpreter lock (GIL) so that only one Python thread is active at the time but GIL can be released during I/O (and other blocking system calls), ctypes release GIL by default, many C extension modules such as numpy, lxml, regex can release GIL during computations: the relevant part is that `requests.get()` is probably I/O bound and BeautifulSoup may use `lxml` if it is installed. – jfs Oct 07 '14 at 12:46
0
def execute(self):
    soup = BeautifulSoup(requests.get(self.url).text)
    threads = []
    for link in soup.find_all('a'):
        url = link.get('href')
        t = threading.Thread(target=self._check_url, args=(url,))
        t.start()
        threads.append(t)
    for thread in threads:
        thread.join()

You could use the join method to wait for all the threads to finish.

Note I also added a start call, and passed the bound method object to the target param. In your original example you were calling _check_url in the main thread and passing the return value to the target param.

GP89
  • 6,600
  • 4
  • 36
  • 64