0

I have a weird error and I will try to simplify my problem. I have a simple function that scraps an url with beautiful soup and returns a list. Then, I pickle the list in file, so I setrecursionlimit(10000) to avoid RecursionError. Until there, everything is good.

But when I try to unpickle my list, I have this error:

Traceback (most recent call last):
  File ".\scrap_index.py", line 86, in <module>
    data_file = pickle.load(data)
TypeError: __new__() missing 1 required positional argument: 'name'

There is my function:

import urllib.request
from bs4 import BeautifulSoup

def scrap_function(url):
    page = urllib.request.urlopen(url)
    soup = BeautifulSoup(page, "html5lib")   

    return [soup]

For testing, I've tried different url. With that url, everything is good:

url_ok = 'https://www.boursorama.com/bourse/'

But with that one, I have the TypeError:

url_not_ok = 'https://www.boursorama.com/bourse/actions'

And the test code:

import pickle
import sys

sys.setrecursionlimit(10000)

scrap_list = scrap_function(url_not_ok)

with open('test_saving.pkl', 'wb') as data:
    pickle.dump(scrap_list, data, protocol=2)

with open('test_saving.pkl', 'rb') as data:
    data_file = pickle.load(data)

print(data_file)
MrCed
  • 57
  • 2
  • 9

1 Answers1

3

This states

If some class objects have extra arguments in the new constructor , pickle fail to serialize it.

This could cause the problem here in beautifulsoap:

class NavigableString(unicode, PageElement):
    def __new__(cls, value):

This answer states the same.

As a solution do not store the whole object but maybe only the source code of the page as mentioned here.

Joe
  • 6,758
  • 2
  • 26
  • 47
  • Thanks, I will try. But if that, why is it working with some urls ? – MrCed Aug 17 '18 at 14:22
  • I've found the solution in your links ! It was an encoding problem. Just have to `return [str(soup)]` to not have a `NavigableString`. Thank you ! – MrCed Aug 17 '18 at 16:14