0

When trying to import csv's with exported MySQL data from S3 to Redshift with the COPY command, I'm running in to the String length exceeds DDL length error.

The table is setup with the same VARCHAR length as the mysql database, but on strings that are close to the max length, it fails because special characters are converted to double dots, increasing the length.

For example, in the error log it shows it trying to enter Rechtsanw..lte, when in the csv it's Rechtsanwälte.

For shorter strings with special characters it works, and they will show up with the special characters once inserted into the table.

I've tried running the command with TRUNCATECOLUMNS, which will allow the operation to complete, but with strings truncated, leaving them shorter than the max length after double dots are converted back into special characters.

Any ideas on why this is happening, and what I can do to solve it?

Boivie
  • 143
  • 1
  • 9
  • `varchar` means ASCII. These aren't special characters though, they are perfectly normal characters which, in UTF8 would appear as 2 bytes. You have a codepage conversion error, not a column length error. Unless your column's encoding is UTF8 you'll lose data. – Panagiotis Kanavos Oct 19 '18 at 12:22
  • @PanagiotisKanavos: `varchar` does not necessarily mean "ASCII" in Postgres –  Oct 19 '18 at 12:25
  • @a_horse_with_no_name how is it wrong? That's an obvious conversion of a *single* character `ä` to *two* unrepresentable bytes – Panagiotis Kanavos Oct 19 '18 at 12:25
  • @a_horse_with_no_name perhaps you should read the comment again? If it's not UTF8, there's data loss with those exact symptoms - UTF8 characters outside the ANSI range end up appearing as multiple bytes – Panagiotis Kanavos Oct 19 '18 at 12:26
  • Thanks for the duplicate link, seems to answer my question. Still a bit confused as to how the data fits in MySQL `varchar` of the same length but not in redshift? – Boivie Oct 19 '18 at 12:34
  • @PanagiotisKanavos: but the claim that `varchar` "is ASCII only" is not true. A `varchar` column can absolutely store UTF-8 characters in Postgres if the database was initialized correctly and the characters are sent correctly to the database. –  Oct 19 '18 at 12:39
  • @a_horse_with_no_name ASCII, ANSI or `single byte codepage`. Apart from the verbose explanation different people use a different acronym to mean "string with codepage". Pick which one you want in this thread, I'll use it – Panagiotis Kanavos Oct 19 '18 at 12:41
  • 1
    @Boivie different databases, different rules. In fact, before findind the link for Redshift I found another one that explained that *older* MySQL versions counted *bytes* while later versions counted *characters*. Redshift is based on an older PostgreSQL version too. – Panagiotis Kanavos Oct 19 '18 at 12:43
  • @a_horse_with_no_name btw Boivie's comment reminded me that Redshift!=PostgreSQL. It was built on top of an *old* PostgreSQL version and has *many* quirks and unsupported keywords. Postgres [counts *characters*](https://stackoverflow.com/questions/4249745/does-postgresql-varchar-count-using-unicode-character-length-or-ascii-character) not bytes. The duplicate link shows that *Redshift* counts *bytes* instead – Panagiotis Kanavos Oct 19 '18 at 12:45
  • @PanagiotisKanavos: still: a `varchar` column _can_ store multi-byte characters without problems. –  Oct 19 '18 at 12:46

0 Answers0