Using HTML entities in BeautifulSoup find

Question

I´m doing some simple crawling in Python (using BeautifulSoup4) and I´m having trouble retrieving tags that contain HTML entities.

This is a small example (just removed the real URLs)

start_url = "..."
next_chapter_bad = "Next Chapter ]&gt;"
next_chapter_good = "Next Chapter ]>"

"""
<td class="comic_navi_right">
    <a href="..." class="navi navi-next-chap" title="Next Chapter ]&gt;">Next Chapter ]&gt;</a>
    <a href="..." class="navi comic-nav-next navi-next" title="Next Page &gt;">Next Page &gt;</a>
    <a href="..." class="navi navi-last" title="Most Recent Page &gt;&gt;">Most Recent Page &gt;&gt;</a>
</td>
"""
page = requests.get(start_url)
if page.status_code != requests.codes.ok:
    return ''

soup = BeautifulSoup(page.text)
# get the url for the "Next chapter" link
next_link = soup.find('a', href=True, string=next_chapter_bad)
print( next_link)
next_link = soup.find('a', href=True, string=next_chapter_good)
print( next_link)

The output is:

None
<a class="navi navi-next-chap" href="..." title="Next Chapter ]&gt;">Next Chapter ]&gt;</a>

Is there a way to make find() work with HTML entities?

Dušan Maďar · Accepted Answer · 2017-11-02T19:48:47.807

1

You have to unescape HTML (https://stackoverflow.com/a/2087433/4183498) as > is escaped >.

from HTMLParser import HTMLParser

...

soup = BeautifulSoup(page.text, 'html.parser')
# get the url for the "Next chapter" link
html_parser = HTMLParser()
next_link = soup.find('a', href=True, string=html_parser.unescape(next_chapter_bad))
print( next_link)

edited Nov 02 '17 at 19:48

answered Nov 02 '17 at 19:38

Dušan Maďar

9,269
5
49
64

Using HTML entities in BeautifulSoup find

1 Answers1