Write Spark Dataframe in PostgreSQL with UTF-8 encoding

Question

I have a Spark Dataframe that must be saved in PostgreSQL. I think I have the appropriate Python sentence except for the encoding options, since I get the following error:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xf3' in position 95: ordinal not in range(128)

My current sentence is as:

df.write.jdbc(url=jdbc_url, table='{}.{}'.format(schema_name, table_name), mode='overwrite', properties=properties)

It seems by default Pyspark is trying to encode the dataframe as ASCII, thus I should specify the correct encoding (UTF-8). How to do that?

I've tried with option("charset", "utf-8"), option("encoding", "utf-8") and many other combinations I've seen in the Internet. I've also tried to add "client_encoding":"utf8" in the properties passed to jdbc. But nothing seems to work.

Any help would be really appreciated.

Additional info:

Python 2.7
Spark 1.6.2

EDIT 1

My database is UTF-8 encoded:

$ sudo -u postgres psql db_test -c 'SHOW SERVER_ENCODING'
 server_encoding 
-----------------
 UTF8
(1 row)

EDIT 2

I noticed together with this error another one was hidden in the logs: the PostgreSQL driver was complaining about the table I wanted to create was already created! Thus, I removed it from PostgreSQL and everything went like a charm :) Unfortunately, I was not able to completely understand how one thing was related to the other... Maybe because the table that was already created used ASCII encoding and there was some kind of incompatibility among it and the data that was intended to be saved?

does this post give any hint? https://stackoverflow.com/questions/9942594/unicodeencodeerror-ascii-codec-cant-encode-character-u-xa0-in-position-20 — vatsal mevada, Nov 28 '17 at 07:20
I've added a second edit. I explain the issue was fixed, but still do not know how :) — frb, Nov 30 '17 at 08:35

score -1 · Answer 1 · answered Nov 28 '17 at 07:12

-1

You should try checking encoding of your postgre Databse.

psql my_database -c 'SHOW SERVER_ENCODING'

If that is not a multi-byte encoding then may be you need to change it to multibyte. See this thread for changing DB encoding:

Also this official documentation might be helpful: https://www.postgresql.org/docs/9.3/static/multibyte.html

answered Nov 28 '17 at 07:12

vatsal mevada

5,148
7
39
68

Thanks for answering. The encoding of the database is UTF-8 (I've edited my question with the result of the command). – frb Nov 28 '17 at 07:16

Write Spark Dataframe in PostgreSQL with UTF-8 encoding

1 Answers1