I had the same issue in multiprocessing
context. It can be illustrated by the following snippet:
from multiprocessing import Pool
import lxml.html
def process(html):
tree = lxml.html.fromstring(html)
body = tree.find('.//body')
print(body)
return body
def main():
pool = Pool()
result = pool.apply(process, ('<html><body/></html>',))
print(type(result))
print(result)
if __name__ == '__main__':
main()
The result of running it is the following output:
<Element body at 0x7f9f690461d8>
<class 'lxml.html.HtmlElement'>
Traceback (most recent call last):
File "test.py", line 18, in <module>
main()
File "test.py", line 14, in main
print(result)
File "src/lxml/lxml.etree.pyx", line 1142, in lxml.etree._Element.__repr__ (src/lxml/lxml.etree.c:54748)
File "src/lxml/lxml.etree.pyx", line 992, in lxml.etree._Element.tag.__get__ (src/lxml/lxml.etree.c:53182)
File "src/lxml/apihelpers.pxi", line 19, in lxml.etree._assertValidNode (src/lxml/lxml.etree.c:16856)
AssertionError: invalid Element proxy at 139697870845496
Thus most obvious explanation, taking into account that __repr__
works from the worker process and the return value is available to the calling process, is deserialisation issue. It can be solved, for example, by returning lxml.html.tostring(body)
, or any other pickle
-able object.