new() missing 1 required positional argument depending of url scraped

Question

I have a weird error and I will try to simplify my problem. I have a simple function that scraps an url with beautiful soup and returns a list. Then, I pickle the list in file, so I setrecursionlimit(10000) to avoid RecursionError. Until there, everything is good.

But when I try to unpickle my list, I have this error:

Traceback (most recent call last):
  File ".\scrap_index.py", line 86, in <module>
    data_file = pickle.load(data)
TypeError: __new__() missing 1 required positional argument: 'name'

There is my function:

import urllib.request
from bs4 import BeautifulSoup

def scrap_function(url):
    page = urllib.request.urlopen(url)
    soup = BeautifulSoup(page, "html5lib")   

    return [soup]

For testing, I've tried different url. With that url, everything is good:

url_ok = 'https://www.boursorama.com/bourse/'

But with that one, I have the TypeError:

url_not_ok = 'https://www.boursorama.com/bourse/actions'

And the test code:

import pickle
import sys

sys.setrecursionlimit(10000)

scrap_list = scrap_function(url_not_ok)

with open('test_saving.pkl', 'wb') as data:
    pickle.dump(scrap_list, data, protocol=2)

with open('test_saving.pkl', 'rb') as data:
    data_file = pickle.load(data)

print(data_file)

score 3 · Accepted Answer · answered Aug 17 '18 at 13:21

3

This states

If some class objects have extra arguments in the new constructor , pickle fail to serialize it.

This could cause the problem here in beautifulsoap:

class NavigableString(unicode, PageElement):
    def __new__(cls, value):

This answer states the same.

As a solution do not store the whole object but maybe only the source code of the page as mentioned here.

answered Aug 17 '18 at 13:21

Joe

6,758
2
26
47

Thanks, I will try. But if that, why is it working with some urls ? – MrCed Aug 17 '18 at 14:22
I've found the solution in your links ! It was an encoding problem. Just have to `return [str(soup)]` to not have a `NavigableString`. Thank you ! – MrCed Aug 17 '18 at 16:14

__new__() missing 1 required positional argument depending of url scraped

1 Answers1

new() missing 1 required positional argument depending of url scraped