I have a soup from BeautifulSoup
that I cannot pickle. When I try to pickle the object the python interpreter silently crashes (such that it cannot be handled as an exception). I have to be able to pickle the object in order to return the object using the multiprocessing
package (which pickles objects to pass them between processes). How can I troubleshoot/work around the problem? Unfortunately, I cannot post the html for the page (it is not publicly available), and I have been unable to find a reproducible example of the problem. I have tried to isolate the problem by looping over the soup and pickling individual components, the smallest thing that produces the error is <class 'BeautifulSoup.NavigableString'>
. When I print the object it prints out u'\n'
.

- 13,244
- 23
- 67
- 115
-
Unfortunately, aside from casting NavigableString to a unicode or str, there's nothing you can do here (well, patch beautifulsoup as well) – dekomote Jul 03 '14 at 21:07
-
@dekomote Is this a known issue with `BeautifulSoup`? – Michael Jul 03 '14 at 21:08
-
1Yup. NavigableString is not pickle-able. It should implement __unicode__ but it fails somehow. – dekomote Jul 03 '14 at 21:09
-
1How do I require all objects created by BeautifulSoup to be turned into unicode prior to pickling and returned to their original type after pickling, keeping in mind I am doing this within the multiprocessing package?. – Michael Jul 03 '14 at 21:25
3 Answers
The class NavigableString
is not serializable with pickle
or cPickle
, which multiprocessing
uses. You should be able to serialize this class with dill
, however. dill
has a superset of the pickle
interface, and can serialize most of python. multiprocessing
will still fail, unless you use a fork of multiprocessing
which uses dill
, called pathos.multiprocessing
.
Get the code here: https://github.com/uqfoundation.
For more information see: What can multiprocessing and dill do together?
http://matthewrocklin.com/blog/work/2013/12/05/Parallelism-and-Serialization/

- 1
- 1

- 33,715
- 8
- 119
- 139
-
Thanks for the info. It would be great is this were installable via pip. When I try to install it I get the error message: "Could not find a version that satisfies the requirement pathos (from versions: 0.1a1)" – Michael Jul 03 '14 at 21:59
-
I'm aware that's an issue with the old released versions. However, you can install the older versions with `pip` if you use the prerelease flag. Then you can grab the code from github and it installs pretty easily. The next version will be pip-installable. – Mike McKerns Jul 04 '14 at 02:49
-
I installed pathos and tried to use its pool function in place of the base multiprocessing package. However, I am encountering the same issue as I describe at the link below. The object that causes it to hang on return *can* be pickled using dill, but it may be too big for multiprocessing queues. Any suggestions for making large objects work with pathos? http://stackoverflow.com/questions/24537379/python-multiprocessing-script-freezes-seemingly-without-error – Michael Jul 07 '14 at 19:38
-
Detailed description in new question: https://stackoverflow.com/questions/24619642/multiprocessing-large-objects-using-pathos-in-python – Michael Jul 07 '14 at 20:51
-
You may be able to get away with compression or with shared memory. Shared memory with `ctypes` through `multiprocessing` might work, if you have access to how the `map` is called. Otherwise, `dill` has some compression options that are currently "turned off". If your large data could go into a `numpy` array (…?), then there might be a route that way too. Hard to tell without seeing what your data looks like. Also, use the latest `pathos` (from github), and `ProcessingPool` as opposed to `Pool`. – Mike McKerns Jul 08 '14 at 00:17
If you do not need the beautiful soup object itself, but some product of the soup, i.e. a text string, you can remove BeautifulSoup attributes from your larger object before pickling by adding the following code to your class definition:
class MyObject(MyObject):
def __getstate__(self):
for item in dir(self):
item_type = str(type(getattr(self, item)))
if 'BeautifulSoup' in itype:
delattr(self, item)
return self.__dict__

- 96,623
- 33
- 114
- 148

- 13,244
- 23
- 67
- 115
-
sure… that makes sense. Essentially, use `__reduce__` to pick out what state you want to save. – Mike McKerns Jul 10 '14 at 03:49
In fact, as suggested by dekomote, you have only to take advantadge that you can allways convert a soup to an unicode string and then back again the unicode string to a soup.
So IMHO you should not try to pass soup object through the multiprocessing package, but simply the strings representing the soups.

- 143,923
- 11
- 122
- 252
-
"you should not try to pass soup object through the multiprocessing package, but simply the strings representing the soups." This should occur automatically though because multiprocessing passes objects by pickling them and the soup should pass its string representation to pickle. – Michael Jul 03 '14 at 21:15
-
I just did some tests and pickle.dump is about 20% faster than prettify while pickle.load is about 35% faster than Beautifulsoup(html) (n=1 with 300kb file, so just to get an idea). Meanwhile pickle doesn't work for large soups due to (hard) recursion limit, so I think this small performance hit is worth it in most cases. – Mark Feb 05 '16 at 16:51