0

According to this older answer, Python 3 strings are UTF-8 compliant by default. But in my web scraper using BeautifulSoup, when I try to print or display a URL, the Japanese characters show up as '%E3%81%82' or '%E3%81%B3' instead of the actual characters.

This Japanese website is the one I'm collecting information from, more specifically the URLs that correspond with the links in the clickable letter buttons. When you hover over for example あa, your browser will show you that the link you're about to click on is https://kokugo.jitenon.jp/cat/gojuon.php?word=あ. However, extracting the ["href"] property of the link using BeautifulSoup, I get https://kokugo.jitenon.jp/cat/gojuon.php?word=%E3%81%82.

Both versions link to the same web page, but for the sake of debugging, I'm wondering if it's possible to make sure the displayed string contains the actual Japanese character. If not, how can I convert the string to accommodate this purpose?

JansthcirlU
  • 688
  • 5
  • 21
  • As a temporary workaround, I made a dictionary where the keys are the messed up character representations (with the percent signs) and the values are the corresponding Japanese characters. – JansthcirlU Jan 06 '21 at 20:41

1 Answers1

1

It's called Percent-encoding:

Percent-encoding, also known as URL encoding, is a method to encode arbitrary data in a Uniform Resource Identifier (URI) using only the limited US-ASCII characters legal within a URI.

Apply the unquote method from urllib.parse module:

urllib.parse.unquote(string, encoding='utf-8', errors='replace')

Replace %xx escapes by their single-character equivalent. The optional encoding and errors parameters specify how to decode percent-encoded sequences into Unicode characters, as accepted by the bytes.decode() method.

string must be a str. Changed in version 3.9: string parameter supports bytes and str objects (previously only str).

encoding defaults to 'utf-8'. errors defaults to 'replace', meaning invalid sequences are replaced by a placeholder character.

Example:

from urllib.parse import unquote
encodedUrl = 'JapaneseChars%E3%81%82or%E3%81%B3'
decodedUrl = unquote( encodedUrl )
print( decodedUrl )
JapaneseCharsあorび

One can apply the unquote method to almost any string, even if already decoded:

print( unquote(decodedUrl) )
JapaneseCharsあorび
JosefZ
  • 28,460
  • 5
  • 44
  • 83
  • Awesome, thanks! As a follow-up, does the percent-encoding occur when the data gets scraped from the website, or when Python tries to display it? – JansthcirlU Jan 07 '21 at 17:41