1

I am trying to scrape a book from a website and while parsing it with Beautiful Soup I noticed that there were some errors. For example this sentence:

"You have more… direct control over your skaa here. How many woul "Oh, a half dozen or so,"

The "more…" and " woul" are both errors that occurred somewhere in the script.

Is there anyway to automatically clean mistakes like this up? Example code of what I have is below.


import requests
from bs4 import BeautifulSoup
url = 'http://thefreeonlinenovel.com/con/mistborn-the-final-empire_page-1'
res = requests.get(url)
text = res.text
soup = BeautifulSoup(text, 'html.parser')
print(soup.prettify())




trin = soup.tr.get_text()
final = str(trin)
print(final)

QHarr
  • 83,427
  • 12
  • 54
  • 101
  • Does this answer your question? [Using BeautifulSoup to get\_text of td tags within a resultset](https://stackoverflow.com/questions/50287133/using-beautifulsoup-to-get-text-of-td-tags-within-a-resultset) – Mike Slinn Apr 02 '21 at 00:20
  • I couldn't find any other way to fix this so I just made another script that uses mostly pandas and it worked fine. Thanks for the info! I am leaving the question up in case someone else can help me to figure out Beautiful Soup some more. As I would love to use it. – Silver Wolf2r Apr 02 '21 at 05:53

1 Answers1

1

You need to escape the convert the html entities as detailed here. To apply in your situation however, and retain the text, you can use stripped_strings:

import requests
from bs4 import BeautifulSoup
import html

url = 'http://thefreeonlinenovel.com/con/mistborn-the-final-empire_page-1'
res = requests.get(url)
text = res.text
soup = BeautifulSoup(text, 'lxml')

for r in soup.select_one('table tr').stripped_strings:
    s = html.unescape(r)
    print(s)
QHarr
  • 83,427
  • 12
  • 54
  • 101