I am copying CSV files from one machine to a remote PostgreSQL 11 server, using a Python 3.6.4 script.
The CSV files have headers (and values) that can have spaces as well as non-ASCII characters, like ∂ and µ. To save the CSV files, I use the following code:
with open(filename, "w", encoding="utf-8") as output:
result.to_csv(output, sep='\t', header=False, index=False, na_rep='null', encoding="utf-8")
Here result is a pandas dataframe
. Then I get the list of column names.
columns = ["\"{}\"".format(f) for f in list(result.head(0))]
The CSV and columns_list is then used to write over to Postgres:
tn = f"{schema}.{tablename} ({', '.join(columns)})"
with open(filename) as f:
subprocess.check_call([
'psql',
'-c', f"COPY {tn} FROM STDIN WITH NULL as 'null'",
'-d', url,
'--set=ON_ERROR_STOP=true'
], stdin=f)
This works beautifully on MacOS (Catalina 10.15.x). But when the above code is run on an Ubuntu instance (either 18.04 or 16.04), I keep getting the following error:
UnicodeEncodeError: 'ascii' codec can't encode character '\xb5' in position 231: ordinal not in range(128)
The error itself is well documented and I went through all of the posted answers, including here, here and here. However, none of them help. I have sprinkled code with encoding=utf-8 as you can see, I have tried defining all the environment variables in the Ubuntu instances, to no avail.
I am not in a position to remove the special characters, they must be preserved in the database. What can I do, either in my python script or in the subprocess.check_call, to fix this issue? I also switched to the following:
import codecs
with codecs.open(filename, encoding='utf-8') as f:
subprocess.run([
'psql',
'-c', f"COPY {tn} FROM STDIN WITH NULL as 'null'",
'-d', url,
'--set=ON_ERROR_STOP=true'
], stdin=f, encoding='utf-8')
But the issue remains the same. Any help is appreciated.