invalid byte sequence for encoding "UTF8": 0xed 0xa0 0xbd

Question

I have been importing some data from MySQL to Postgres, the plan should have been simple- manually re-create the tables with their equivalent data types, divise a way to output as CSV, transfer over the data, copy it into Postgres. Done.

mysql -u whatever -p whatever -d the_database

SELECT * INTO OUTFILE '/tmp/the_table.csv' FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' ESCAPED BY '\\' FROM the_table;

send and import to postgres

psql -etcetc -d other_database

COPY the_table FROM '/csv/file/location/the_table.csv' WITH( FORMAT CSV, DELIMITER ',', QUOTE '"', ESCAPE '\', NULL '\N' );

It had been too long, I had forgotten that '0000-00-00' was a thing... so first of all I had to come up with some way of addressing weird data types, preferably at the MySQL end and so wrote this script for the 20 or so tables I planned to import to address any imcompatabilities and list out the columns accordingly

with a as (
    select
        'the_table'::text as tblname,
        'public'::text as schname
), b as (
    select array_to_string( array_agg( x.column_name ), ',' ) as the_cols from (
        select
            case
                when udt_name = 'timestamp'
                then 'NULLIF('|| column_name::text || ',''0000-00-00 00:00:00'')'
                when udt_name = 'date'
                then 'NULLIF('|| column_name::text || ',''0000-00-00'')'
                else column_name::text
            end as column_name
        from information_schema.columns, a
        where table_schema = a.schname
        and table_name = a.tblname
        order by ordinal_position
    ) x
)
select 'SELECT '|| b.the_cols ||' INTO OUTFILE ''/tmp/'|| a.tblname ||'.csv'' FIELDS TERMINATED BY '','' OPTIONALLY ENCLOSED BY ''"'' ESCAPED BY ''\\'' FROM '|| a.tblname ||';' from a,b;

Generate CSV, ok. Transfer across, ok - Once over...

BEGIN;
ALTER TABLE the_table SET( autovacuum_enabled = false, toast.autovacuum_enabled = false );
COPY the_table FROM '/csv/file/location/the_table.csv' WITH( FORMAT CSV, DELIMITER ',', QUOTE '"', ESCAPE '\', NULL '\N' ); -- '
ALTER TABLE the_table SET( autovacuum_enabled = true, toast.autovacuum_enabled = true );
COMMIT;

and it was all going well, until I came across this message:

ERROR:  invalid byte sequence for encoding "UTF8": 0xed 0xa0 0xbd
CONTEXT:  COPY new_table, line 12345678

a second table also encountered the same error however every other one imported successfully. Now all columns and tables in the MySQL db were set to utf8, the first offending table containing messages was along the lines of

CREATE TABLE whatever(
col1 int(11) NOT NULL AUTO_INCREMENT,
col2 date,
col3 int(11),
col4 int(11),
col5 int(11),
col6 int(11),
col7 varchar(64),
PRIMARY KEY(col1)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

So presumably the data should be utf... right? to make sure there were no major errors I edited the my.cnf to ensure everything I could think of to include the encoding

[character sets]
default-character-set=utf8
default-character-set=utf8
character-set-server = utf8
collation-server = utf8_unicode_ci
init-connect='SET NAMES utf8'

I altered my initial "query generating query" case statement to convert columns for the sake of converting

        case
            when udt_name = 'timestamp'
            then 'NULLIF('|| column_name::text || ',''0000-00-00 00:00:00'')'
            when udt_name = 'date'
            then 'NULLIF('|| column_name::text || ',''0000-00-00'')'
            when udt_name = 'text'
            then 'CONVERT('|| column_name::text || ' USING utf8)'
            else column_name::text
        end as column_name

and still no luck. After googling "0xed 0xa0 0xbd" I am still none the wiser, character sets are not really my thing. I even opened the 3 gig csv file to the line it mentioned and there didn't appear to be anything out of place, looking with a hex editor I could not see those byte values (edit: maybe I didn't look hard enough) so I am starting to run out of ideas. Am I missing something really simple, and worryingly, is it possible that some of the other tables may have been more "silently" corrupted too?

The MySQL version is 5.5.44 on a ubuntu 14.04 operating system and the Postgres is 9.4

By the look of [the table in this answer](http://stackoverflow.com/a/6555104/1411457), 0xed 0xa0 0xbd is definitely invalid UTF8. But I can't see how you can be getting this error if the file does not contain that byte sequence somewhere. — harmic, Aug 27 '15 at 03:01
That sequence encodes the code-point `U+d83d`. It's a structurally valid sequence but it encodes an invalid character. http://www.charbase.com/d83d-unicode-invalid-character . At a guess, MySQL's validation is more lax than PostgreSQL's, so MySQL allowed it and PostgreSQL rejects it. — Craig Ringer, Aug 27 '15 at 05:04
Search for "unicode eda0bd" -- it seems to be valid, but unassigned: http://dev.networkerror.org/utf8/?start=55335&end=55590&cols=4&show_uni_int=on&show_uni_hex=on&show_html_ent=on&show_raw_hex=on&show_raw_bin=on — Rick James, Aug 28 '15 at 17:02

Lucas · Accepted Answer · 2015-08-28T12:54:47.063

Without any further things to try I went for the simplest solution, just alter the files

iconv -f utf-8 -t utf-8 -c the_file.csv > the_file_iconv.csv

there were about 100 bytes between the new files and the originals, so there must've been invalid bytes in there somewhere that I could not see, they imported "properly" so I suppose that is good, however it would be nice to know if there were some way to enforce proper encoding when creating the files before discovering about it on import.

invalid byte sequence for encoding "UTF8": 0xed 0xa0 0xbd

1 Answers1