2

The Google Translate Python API has a format_ keyword which may be set to "html": https://googlecloudplatform.github.io/google-cloud-python/latest/translate/client.html

I have some HTML for a news article which was retrieved using the newspaper3k package: https://github.com/codelucas/newspaper/

The HTML is a binary string that starts like this:

b'<!DOCTYPE html>\r\n<html xmlns="http://www.w3.org/1999/xhtml" lang="ar" dir="rtl" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://opengraphprotocol.org/schema/">\r\n<head>\r\n\t<!-- Meta, title, CSS, favicons, etc. -->\r\n\t<meta charset="UTF-8" />\r\n\t<meta http-equiv="Conten

I try to translate this HTML (which is largely in Arabic) into English using this Google Translate Python API call:

html_english=translate_client.translate(html_arabic, target_language='en', format_='html')

This results in the following error (object of type bytes is not JSON serializable). What am I doing wrong?

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)

~\AppData\Local\conda\conda\envs\xview\lib\json\encoder.py in default(self, o)
    178         """
    179         raise TypeError("Object of type '%s' is not JSON serializable" %
--> 180                         o.__class__.__name__)
    181 
    182     def encode(self, o):

TypeError: Object of type 'bytes' is not JSON serializable
Lars Ericson
  • 1,952
  • 4
  • 32
  • 45
  • What do you mean by "only documented by example"? When I click on the links on the left side, I get [what certainly looks like reference API docs](https://googlecloudplatform.github.io/google-cloud-python/latest/translate/client.html). – abarnert Sep 09 '18 at 23:38
  • Also, "The input text can be plain text or HTML." So what happens if you just send the article's HTML as your input text? – abarnert Sep 09 '18 at 23:45
  • Thanks, I didn't see that API page, I could only find the example. I will try that now and update in a minute. – Lars Ericson Sep 09 '18 at 23:48

1 Answers1

1

And the answer is (thanks to @abarnert and Python 3: Is not JSON serializable) to decode the binary string from newspaper3k to UTF-8 which JSON prefers and Google Translate uses to move the payload, by adding .decode("utf-8"):

html_english=translate_client.translate(
      html_arabic.decode("utf-8"), 
      target_language='en', format_='html')
Lars Ericson
  • 1,952
  • 4
  • 32
  • 45