10

I am trying to understand why running multiple parsers in parallel threads does not speed up parsing HTML. One thread does 100 tasks twice as fast as two threads with 50 tasks each.

Here is my code:

from lxml.html import fromstring
import time
from threading import Thread
try:
    from urllib import urlopen
except ImportError:
    from urllib.request import urlopen

DATA = urlopen('http://lxml.de/FAQ.html').read()


def func(number):
    for x in range(number):
        fromstring(DATA)


print('Testing one thread (100 job per thread)')
start = time.time()
t1 = Thread(target=func, args=[100])
t1.start()
t1.join()
elapsed = time.time() - start
print('Time: %.5f' % elapsed)

print('Testing two threads (50 jobs per thread)')
start = time.time()
t1 = Thread(target=func, args=[50])
t2 = Thread(target=func, args=[50])
t1.start()
t2.start()
t1.join()
t2.join()
elapsed = time.time() - start
print('Time: %.5f' % elapsed)

Output on my 4 cores CPU machine:

Testing one thread (100 job per thread)
Time: 0.55351
Testing two threads (50 jobs per thread)
Time: 0.88461

According to the FAQ (http://lxml.de/FAQ.html#can-i-use-threads-to-concurrently-access-the-lxml-api) two threads should work faster than one thread.

Since version 1.1, lxml frees the GIL (Python's global interpreter lock) internally when parsing from disk and memory, as long as you use either the default parser (which is replicated for each thread) or create a parser for each thread yourself.

...

The more of your XML processing moves into lxml, however, the higher your gain. If your application is bound by XML parsing and serialisation, or by very selective XPath expressions and complex XSLTs, your speedup on multi-processor machines can be substantial.

So, the question is why two threads are slower than one thread?

My environment: linux debian, lxml 3.3.5-1+b1, same results on python2 and python3

BTW, my friend tried to run this test on macos and got same timings for one and for two threads. Anyway, that is not as it supposed to be according to the documentation (two threads should be twice as fast).

UPD: Thanks to spectras. He pointed that it needs to create a parser in each thread. The updated code of the func function is:

from lxml.html import HTMLParser
from lxml.etree import parse

def func(number):
    parser = HTMLParser()
    for x in range(number):
        parse(StringIO(DATA), parser=parser)

The output is:

Testing one thread (100 jobs per thread)
Time: 0.53993
Testing two threads (50 jobs per thread)
Time: 0.28869

That is exactly what I wanted! :)

2 Answers2

6

The documentation gives a good lead there: "as long as you use either the default parser (which is replicated for each thread) or create a parser for each thread yourself."

You're definitely not creating a parser for each thread. You can see that, if you do not specify a parser yourself, the fromstring function uses a global one.

Now for the other condition, you can see at the bottom of the file that html_parser is a subclass of lxml.etree.HTMLParser. With no special behavior and most importantly no thread local storage. I cannot test here but I would believe you end up sharing a parser across your two threads, which does not qualify as "default parser".

Could you try instanciating the parsers yourself and feeding them to fromstring? Or I'll do it in an hour or so and update this post.

def func(number):
    parser = HTMLParser()
    for x in range(number):
        fromstring(DATA, parser=parser)
spectras
  • 13,105
  • 2
  • 31
  • 53
-1

That's because how threads work in python. And there are differences between python 2.7 and python 3. If you really want to speed up the parsing you should use multiprocessing and not multithreading. Read this: How do threads work in Python, and what are common Python-threading specific pitfalls?

And this is about multiprocessing : http://sebastianraschka.com/Articles/2014_multiprocessing_intro.html

As long as it's not an io operations, when you use threads you add overhead of the context switching because only one thread can run at a time. When are Python threads fast?

Good luck.

Community
  • 1
  • 1
wa11a
  • 183
  • 1
  • 7
  • 1
    lxml's documentation explicitly mentions that it releases the GIL when parsing data. That means the usual caveats of python threads should not apply, as long as the conditions that trigger the releasing of the GIL by libxml are met. – spectras Aug 29 '15 at 11:42