UTF-8 text from website is decoded improperly when using python 3 and requests, works well with Python 2 and mechanize

Question

I've been tinkering with Python using Pythonista on my iPad. I decided to write a simple script that pulls song lyrics in Japanese from one website, and makes post requests to another website that basically annotates the lyrics with extra information.

When I use Python 2 and the module mechanize for the second website, everything works fine, but when I use Python 3 and requests, the resulting text is nonsense.

This is a minimal script that doesn't exhibit the issue:

#!/usr/bin/env python2

from bs4 import BeautifulSoup

import requests
import mechanize

def main():
    # Get lyrics from first website (lyrical-nonsense.com)
    url = 'https://www.lyrical-nonsense.com/lyrics/bump-of-chicken/hello-world/'
    html_raw_lyrics = BeautifulSoup(requests.get(url).text, "html5lib") 
    raw_lyrics = html_raw_lyrics.find("div", id="Lyrics").get_text()

    # Use second website to anotate lyrics with fugigana
    browser = mechanize.Browser()
    browser.open('http://furigana.sourceforge.net/cgi-bin/index.cgi')
    browser.select_form(nr=0)
    browser.form['text'] = raw_lyrics
    request = browser.submit()

    # My actual script does more stuff at this point, but this snippet doesn't need it

    annotated_lyrics = BeautifulSoup(request.read().decode('utf-8'), "html5lib").find("body").get_text()
    print annotated_lyrics

if __name__ == '__main__':
    main()

The truncated output is:

扉(とびら)開(ひら)けば捻(ねじ)れた昼(ひる)の夜(よる)昨日(きのう)どうやって帰(かえ)った体(からだ)だけが確(たし)かおはよう　これからまた迷子(まいご)の続(つづ)き見慣(みな)れた知(し)らない景色(けしき)の中(なか)でもう駄目(だめ)って思(おも)ってから　わりと何(なん)だかやれている死(し)にきらないくらいに丈夫(じょうぶ)何(なに)かちょっと恥(は)ずかしいやるべきことは忘(わす)れていても解(わか)るそうしないと　とても苦(くる)しいから顔(かお)を上(あ)げて黒(くろ)い目(め)の人(にん)君(くん)が見(み)たから光(ひかり)は生(う)まれた選(えら)んだ色(しょく)で塗(ぬ)った世界(せかい)に [...]

This is a minimal script that exhibits the issue:

#!/usr/bin/env python3
from bs4 import BeautifulSoup

import requests

def main():
    # Get lyrics from first website (lyrical-nonsense.com)
    url = 'https://www.lyrical-nonsense.com/lyrics/bump-of-chicken/hello-world/'
    html_raw_lyrics = BeautifulSoup(requests.get(url).text, "html5lib") 
    raw_lyrics = html_raw_lyrics.find("div", id="Lyrics").get_text()

    # Use second website to anotate lyrics with fugigana
    url = 'http://furigana.sourceforge.net/cgi-bin/index.cgi'
    data = {'text': raw_lyrics, 'state': 'output'}
    html_annotated_lyrics = BeautifulSoup(requests.post(url, data=data).text, "html5lib")
    annotated_lyrics = html_annotated_lyrics.find("body").get_text()

    print(annotated_lyrics)

if __name__ == '__main__':
    main()

whose truncated output is:

IQp{_<n(åiFcf0c_S`QLºKJoFSK~_÷PnMc_åjDorn-gFÄîcfcfKhU`KfD{kMjDOD+UKacheZKWDyMSho،fDfã]FWjDhhfæWDKTRfÒDînºL_KIo~_x`rgWc_Lkò~fxyjD·nsoiS`FTê`QLÒüíüLn [...]

It's worth noting that if I just try to get the HTML of the second request, like so:

# Use second website to anotate lyrics with fugigana
url = 'http://furigana.sourceforge.net/cgi-bin/index.cgi'
data = {'text': raw_lyrics, 'state': 'output'}
annotated_lyrics = requests.post(url, data=data).content.decode('utf-8')

A embedded null character error occurs when printing annotated_lyrics. This issue can be circumvented by passing truncated lyrics to the post requests. In the current example, only one character can be passed.

However, with

url = 'https://www.lyrical-nonsense.com/lyrics/aimer/brave-shine/'

I can pass up to 51 characters, like so:

data = {'text': raw_lyrics[0:51], 'state': 'output'}

before triggering the embedded null character error.

I've tried using urllib instead of requests, decoding and encoding to utf-8 the resulting HTML of the post request, or the data passed as an argument to this request. I've also checked that the encoding of the website is utf-8, which matches the encoding of the post requests:

r = requests.post(url, data=data)   
print(r.encoding)

prints utf-8.

I think the problem has to do with how Python 3 is more strict in how it treats strings vs bytes, but I've been unable to pinpoint the exact cause.

While I'd appreciate a working code sample in Python 3, I'm more interested in what exactly I'm doing wrong, in what is the code doing that results in failure.

Welcome to StackOverflow! I am wondering if you tried confirming that with `Python3` what you get in `annotated_lyrics` is not a string? Maybe `.decode('utf-8')` would helped... Have you tried that? — sophros, Aug 31 '18 at 05:48
@sophros Thank you! And yes, I have tried decoding the result of the post request made to the second url. In the minimal sample code I've proviced, running `annotated_lyrics.decode('utf-8')` makes the interpreter complain about how `'str' object has no attribute 'decode'`, which means `annotated_lyrics` *is* a string. I have also tried calling `decode('utf-8')` and/or `.encode('utf-8')` in many places, to no avail. I also added these extra `.encode('utf-8')` when I tried using `urllib` instead of `requests.` — eli, Aug 31 '18 at 06:02
To narrow things down, is the output of `request.read().decode('utf-8')` (python2 script) the same as `requests.post(url, data=data).text` (python3)? — snakecharmerb, Sep 01 '18 at 08:22
I am not an expert in those libraries you use but I have come across the following question: https://stackoverflow.com/questions/13837848/converting-byte-string-in-unicode-string. In the examples there is still a `.decode('utf-8')` step applied to the returned bytes. This should help. — sophros, Aug 31 '18 at 05:51
I think this is exactly what I do in the third code snipped, with this line `annotated_lyrics = requests.post(url, data=data).content.decode('utf-8')`, and that doesn't work. The `.content` returns a `bytes` object, and then calling `decode('utf-8')` on that objects results in the same object as when I just directly call `.text` on the request, just like in the second code snippet. — eli, Aug 31 '18 at 06:05
@snakecharmerb yes, the output is the same. I've accepted an answer that explains that the issue had nothing to do with anything not being encoding it UTF-8, but with how data was passed to the post request: it needs to be passed via a `files` parameter because the form is a multipart form. `requests` defaults to urlencoding (when using the `data` parameter), so that's why the data was mangled horribly, — eli, Sep 03 '18 at 02:57

score 1 · Accepted Answer · answered Sep 01 '18 at 18:42

I'm able to get the lyrics properly with this code in python3.x:

url = 'https://www.lyrical-nonsense.com/lyrics/bump-of-chicken/hello-world/'
resp = requests.get(url)
print(BeautifulSoup(resp.text).find('div', class_='olyrictext').get_text())

Printing (truncated)

>>> BeautifulSoup(resp.text).find('div', class_='olyrictext').get_text()
'扉開けば\u3000捻れた昼の夜\r\n昨日どうやって帰った\u3000体だけ...'

A few things strike me as odd there, notably the \r\n (windows line ending) and \u3000 (IDEOGRAPHIC SPACE) but that's probably not the problem

The one thing I noticed that's odd about the form submission (and why the browser emulator probably succeeds) is the form is using multipart instead of urlencoded form data. (signified by enctype="multipart/form-data")

Sending multipart form data is a little bit strange in requests, I had to poke around a bit and eventually found this which helps show how to format the multipart data in a way that the backing server understands. To do this you have to abuse files but have a "None" filename. "for humans" hah!

url2 = 'http://furigana.sourceforge.net/cgi-bin/index.cgi'
resp2 = requests.post(url2, files={'text': (None, raw_lyrics), 'state': (None, 'output')})

And the text is not mangled now!

>>> BeautifulSoup(resp2.text).find('body').get_text()
'\n扉(とびら)開(ひら)けば捻(ねじ)れた昼(ひる)...'

(Note that this code should work in either python2 or python3)

Yes, this solved the issue. I think at some point I *did* stumble upon some of these resources, but then I probably dismissed them because the argument is called `files` (and I'm not doing anything with `files`, or so I thought), and because I was convinced it was some kind of encoding issue. Thank you so much! I've already fixed the code with your suggestion, but I'm definitely going to read some more about multipart forms. In particular, I wonder how other software deals with them, namely Workflow for iOS. — eli, Sep 03 '18 at 02:51
If you found this answer useful, be sure to upvote -- welcome to Stack overflow :) — anthony sottile, Sep 03 '18 at 03:46
Thanks! I did upvote it, but my reputation is too low so it doesn't show publicly. I did accept the answer though. — eli, Sep 04 '18 at 04:05

UTF-8 text from website is decoded improperly when using python 3 and requests, works well with Python 2 and mechanize

1 Answers1