I have been importing some data from MySQL to Postgres, the plan should have been simple- manually re-create the tables with their equivalent data types, divise a way to output as CSV, transfer over the data, copy it into Postgres. Done.
mysql -u whatever -p whatever -d the_database
SELECT * INTO OUTFILE '/tmp/the_table.csv' FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' ESCAPED BY '\\' FROM the_table;
send and import to postgres
psql -etcetc -d other_database
COPY the_table FROM '/csv/file/location/the_table.csv' WITH( FORMAT CSV, DELIMITER ',', QUOTE '"', ESCAPE '\', NULL '\N' );
It had been too long, I had forgotten that '0000-00-00' was a thing... so first of all I had to come up with some way of addressing weird data types, preferably at the MySQL end and so wrote this script for the 20 or so tables I planned to import to address any imcompatabilities and list out the columns accordingly
with a as (
select
'the_table'::text as tblname,
'public'::text as schname
), b as (
select array_to_string( array_agg( x.column_name ), ',' ) as the_cols from (
select
case
when udt_name = 'timestamp'
then 'NULLIF('|| column_name::text || ',''0000-00-00 00:00:00'')'
when udt_name = 'date'
then 'NULLIF('|| column_name::text || ',''0000-00-00'')'
else column_name::text
end as column_name
from information_schema.columns, a
where table_schema = a.schname
and table_name = a.tblname
order by ordinal_position
) x
)
select 'SELECT '|| b.the_cols ||' INTO OUTFILE ''/tmp/'|| a.tblname ||'.csv'' FIELDS TERMINATED BY '','' OPTIONALLY ENCLOSED BY ''"'' ESCAPED BY ''\\'' FROM '|| a.tblname ||';' from a,b;
Generate CSV, ok. Transfer across, ok - Once over...
BEGIN;
ALTER TABLE the_table SET( autovacuum_enabled = false, toast.autovacuum_enabled = false );
COPY the_table FROM '/csv/file/location/the_table.csv' WITH( FORMAT CSV, DELIMITER ',', QUOTE '"', ESCAPE '\', NULL '\N' ); -- '
ALTER TABLE the_table SET( autovacuum_enabled = true, toast.autovacuum_enabled = true );
COMMIT;
and it was all going well, until I came across this message:
ERROR: invalid byte sequence for encoding "UTF8": 0xed 0xa0 0xbd
CONTEXT: COPY new_table, line 12345678
a second table also encountered the same error however every other one imported successfully. Now all columns and tables in the MySQL db were set to utf8, the first offending table containing messages was along the lines of
CREATE TABLE whatever(
col1 int(11) NOT NULL AUTO_INCREMENT,
col2 date,
col3 int(11),
col4 int(11),
col5 int(11),
col6 int(11),
col7 varchar(64),
PRIMARY KEY(col1)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
So presumably the data should be utf... right? to make sure there were no major errors I edited the my.cnf to ensure everything I could think of to include the encoding
[character sets]
default-character-set=utf8
default-character-set=utf8
character-set-server = utf8
collation-server = utf8_unicode_ci
init-connect='SET NAMES utf8'
I altered my initial "query generating query" case statement to convert columns for the sake of converting
case
when udt_name = 'timestamp'
then 'NULLIF('|| column_name::text || ',''0000-00-00 00:00:00'')'
when udt_name = 'date'
then 'NULLIF('|| column_name::text || ',''0000-00-00'')'
when udt_name = 'text'
then 'CONVERT('|| column_name::text || ' USING utf8)'
else column_name::text
end as column_name
and still no luck. After googling "0xed 0xa0 0xbd" I am still none the wiser, character sets are not really my thing. I even opened the 3 gig csv file to the line it mentioned and there didn't appear to be anything out of place, looking with a hex editor I could not see those byte values (edit: maybe I didn't look hard enough) so I am starting to run out of ideas. Am I missing something really simple, and worryingly, is it possible that some of the other tables may have been more "silently" corrupted too?
The MySQL version is 5.5.44 on a ubuntu 14.04 operating system and the Postgres is 9.4