Python encoding differences between MacOS and Ubuntu causing UnicodeEncodeError

Question

I am copying CSV files from one machine to a remote PostgreSQL 11 server, using a Python 3.6.4 script.

The CSV files have headers (and values) that can have spaces as well as non-ASCII characters, like ∂ and µ. To save the CSV files, I use the following code:

with open(filename, "w", encoding="utf-8") as output:
    result.to_csv(output, sep='\t', header=False, index=False, na_rep='null', encoding="utf-8")

Here result is a pandas dataframe. Then I get the list of column names.

columns = ["\"{}\"".format(f) for f in list(result.head(0))]

The CSV and columns_list is then used to write over to Postgres:

    tn = f"{schema}.{tablename} ({', '.join(columns)})"
    with open(filename) as f:
        subprocess.check_call([
            'psql',
            '-c', f"COPY {tn} FROM STDIN WITH NULL as 'null'",
            '-d', url,
            '--set=ON_ERROR_STOP=true'
        ], stdin=f)

This works beautifully on MacOS (Catalina 10.15.x). But when the above code is run on an Ubuntu instance (either 18.04 or 16.04), I keep getting the following error:

UnicodeEncodeError: 'ascii' codec can't encode character '\xb5' in position 231: ordinal not in range(128)

The error itself is well documented and I went through all of the posted answers, including here, here and here. However, none of them help. I have sprinkled code with encoding=utf-8 as you can see, I have tried defining all the environment variables in the Ubuntu instances, to no avail.

I am not in a position to remove the special characters, they must be preserved in the database. What can I do, either in my python script or in the subprocess.check_call, to fix this issue? I also switched to the following:

import codecs


    with codecs.open(filename, encoding='utf-8') as f:
        subprocess.run([
            'psql',
            '-c', f"COPY {tn} FROM STDIN WITH NULL as 'null'",
            '-d', url,
            '--set=ON_ERROR_STOP=true'
        ], stdin=f, encoding='utf-8')

But the issue remains the same. Any help is appreciated.

Have you made sure that the [default client encoding](https://stackoverflow.com/questions/36922248/how-do-i-change-the-default-client-encoding-in-postgres) in set to UTF8 in postgresql on both systems? — MrBean Bremen, May 11 '20 at 04:53
So, you did set `PGCLIENTENCODING` as described in that post? Though that probably isn't your problem, as you don't use a command line client... — MrBean Bremen, May 11 '20 at 06:58
Yes, I tried PGCLIENTENCODING and all of the rest of the environment variables, to no avail. The issue seems to be with the COPY tn part, as the first table for which this fails has a column with a non-ASCII character in it. — LNI, May 11 '20 at 16:17

score 0 · Answer 1 · answered May 12 '20 at 00:10

This is how I have made it work. I'd still prefer a better method, but I am now escaping the encoding issue (pun heavily intended). This method does not support non-ASCII characters in the table names, but since that is not in my current requirements I am not too worried about it. It correctly handles non-ASCII characters in both column names and field values.

The UnicodeEncodeError was being thrown at this line, as one of the columns had a non-ASCII character:

tn = f"{schema}.{tablename} ({', '.join(columns)})"

So, now I keep the column names in the CSV file as well. Each of the steps below is critical, can't skip any of them:

# First, re-arrange the dataframe columns to match definition order in database. 
# The dictionary table_headers contains the columns as defined within the database 
# for specified table.
for entry in table_headers[table]:
    if entry not in result:
        result[entry] = 'null'
result = result[table_headers[table]]
# Set index. You may or may not wish to do so. In this case it made sense for me.
result.set_index(result_id, inplace=True)
# Finally, write it out. Note that index=True is being set because of line above.
with open(fn, "w", encoding="utf-8") as output:
    result.to_csv(output, sep='\t', header=True, index=True, na_rep='null', encoding="utf-8")

Now, in the database copying stage I do this:

tn = f"{schema}.\"{tablename}\""
with codecs.open(filename, encoding='utf-8') as f:
    subprocess.run([
        'psql',
        '-c', f"COPY {tn} FROM STDIN WITH NULL as 'null' CSV HEADER DELIMITER '\t'",
        '-d', url,
        '--set=ON_ERROR_STOP=true'
    ], stdin=f, encoding='utf-8')

And it works beautifully. On MacOS, and on Ubuntu. I probably don't need the codecs.open and other utf-8 entries but they are not harmful, so leaving them there for now.

Python encoding differences between MacOS and Ubuntu causing UnicodeEncodeError

1 Answers1