How to save the BeautifulSoup object to a file and then read from it as BeautifulSoup?

Question

I want to save the BeautifulSoup object to a file. So, I change it into a string, then write it to a file. Then after reading it as a string, I convert the string into a BeautifulSoup object. This would help during my testing as the data I am scraping is dynamic.

url = "https://coinmarketcap.com/all/views/all/"
html = urlopen(url)
soup = BeautifulSoup(html,"lxml")

Writing the soup object like this:

  new_soup = str(soup)
  with open("coin.txt", "w+") as f:
      f.write(new_soup)

produces this error:

UnicodeEncodeError: 'charmap' codec can't encode 
characters in position 28127-28132: character maps to <undefined>

Also, if I am able to save it to a file, how would I read the string returned as a BeautifulSoup object?

Possible duplicate of [How can I use pickle to save a dict?](https://stackoverflow.com/questions/11218477/how-can-i-use-pickle-to-save-a-dict) — Oliver Baumann, Oct 24 '18 at 16:21
you are doing terrible something. you better save the html instead — KC., Oct 25 '18 at 06:26

Edgar Ramírez Mondragón · Answer 1 · 2018-10-26T17:18:31.017

EDIT

The old code couldn't pickle the soup object due to RecursionError:

Traceback (most recent call last):
  File "soup.py", line 20, in <module>
    pickle.dump(soup, f)
RecursionError: maximum recursion depth exceeded while calling a Python object

The solution is to increase the recursion limit. They do the same in this answer, which in turn, references the docs.

HOWEVER, the particular site you're trying to load and save is extremely nested. My computer can't get past a recursion of limit of 50000 and it's not enough for your site and crashes: 10008 segmentation fault (core dumped) python soup.py.

So, if you need to download the HTML and use it later you can do this:

from bs4 import BeautifulSoup
from urllib.request import urlopen

url = "https://coinmarketcap.com/all/views/all/"
html = urlopen(url)

# Save HTML to a file
with open("soup.html", "wb") as f:
    while True:
        chunk = html.read(1024)
        if not chunk:
            break
        f.write(chunk)

Then you can read the HTML file you saved and instantiate the bs4 object with it:

# Read HTML from a file
with open("soup.html", "rb") as f:
    soup = BeautifulSoup(f.read(), "lxml")

print(soup.title)
# <title>All Cryptocurrencies | CoinMarketCap</title>

Additionally, this is the code I would use for a less nested site:

import pickle
from bs4 import BeautifulSoup
from urllib.request import urlopen
import sys

url = "https://stackoverflow.com/questions/52973700/how-to-save-the-beautifulsoup-object-to-a-file-and-then-read-from-it-as-beautifu"
html = urlopen(url)
soup = BeautifulSoup(html,"lxml")

sys.setrecursionlimit(8000)

# Save the soup object to a file
with open("soup.pickle", "wb") as f:
    pickle.dump(soup, f)

# Read the soup object from a file
with open("soup.pickle", "rb") as f:
    soup_obj = pickle.load(f)

print(soup_obj.title)

# <title>python - How to save the BeautifulSoup object to a file and then read from it as BeautifulSoup? - Stack Overflow</title>.

I get `RecursionError: maximum recursion depth exceeded while pickling an object` when I run this code, because of the object is huge. — rockikz, Oct 24 '18 at 18:03
You're both right. I edited my answer. Turns out the resulting soup object is too nested to pickle it, but there are workarounds. — Edgar Ramírez Mondragón, Oct 26 '18 at 17:19

How to save the BeautifulSoup object to a file and then read from it as BeautifulSoup?

1 Answers1

EDIT