Concurrently parse in-memory xml tree with lxml

Question

Say I have a program that looks like this:

from lxml import etree

class ParseXmlFile(object):
    def __init__(self, xml_to_parse):
        self.xml = etree.parse(xml_to_parse)

    def a(self):
        return self.xml.xpath('//something')

    def b(self):
        return self.xml.xpath('//something-else')

lxml frees the GIL, so it should be possible to run a and b concurrently in separate threads or processes.

From the lxml docs:

lxml frees the GIL (Python's global interpreter lock) internally when parsing from disk and memory...The global interpreter lock (GIL) in Python serializes access to the interpreter, so if the majority of your processing is done in Python code (walking trees, modifying elements, etc.), your gain will be close to zero. The more of your XML processing moves into lxml, however, the higher your gain. If your application is bound by XML parsing and serialisation, or by very selective XPath expressions and complex XSLTs, your speedup on multi-processor machines can be substantial.

I have done little to no work with multithreading.

Your run of the mill multiprocessing implementation would use something like multiprocessing.Pool().map(), which seems to be of no use here since I have a list of functions and a single argument rather than a single function and a list of arguments. Attempting to wrap each function in another function and then multiprocess as described in one of the answers raises the following exception:

cPickle.PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed

Is it possible to do what I'm describing? If so, how?

http://stackoverflow.com/questions/25991860/unable-to-pass-an-lxml-etree-object-to-a-separate-process — Padraic Cunningham, Aug 05 '16 at 22:00
@PadraicCunningham I don't understand how this helps. In and of itself what you linked to doesn't answer my question. In any event the error I experience with the answer below is due to functions not being pickle-able. Registering etree with the pickler would not solve that. — AutomaticStatic, Aug 07 '16 at 00:11

score 1 · Answer 1 · answered Aug 05 '16 at 21:28

1

Functions are data, so you can do something like this:

from multiprocessing import Pool

def f1(xml):
  print "applying f1 to xml"

def f2(xml):
  print "applying f2 to xml"

if __name__ == '__main__':
    xml = "the xml"

    def applyf(f):
       f(xml)

    p = Pool(5)
    print(p.map(applyf, [f1, f2]))

answered Aug 05 '16 at 21:28

ErikR

51,541
9
73
124

`cPickle.PicklingError: Can't pickle : attribute lookup __builtin__.function failed` – AutomaticStatic Aug 05 '16 at 21:48
The code above runs without any errors under python 2.7. How are you getting that error? – ErikR Aug 05 '16 at 22:15

Concurrently parse in-memory xml tree with lxml

1 Answers1