0

I am trying to load a 3GB (24 Million rows) csv file to greenplum database using gpload functionality but I keep getting the below error

Error -

 invalid byte sequence for encoding "UTF8": 0x8d

I have tried solution provided by Mike but for me, my client_encoding and file encoding are already the same. Both are UNICODE.

Database -

show client_encoding;
"UNICODE"

File -

file my_file_name.csv
my_file_name.csv: UTF-8 Unicode (with BOM) text

I have browsed through Greenplum's documentation as well, which says the encoding of external file and database should match. It is matching in my case yet somehow it is failing.

I have uploaded similar smaller files as well (same UTF-8 Unicode (with BOM) text)

Any help is appreciated !

Pirate X
  • 3,023
  • 5
  • 33
  • 60

1 Answers1

2

Posted in another thread - use the iconv command to strip these characters out of your file. Greenplum is instantiated using a character set, UTF-8 by default, and requires that all characters be of the designated character set. You can also choose to log these errors with the LOG ERRORS clause of the EXTERNAL TABLE. This will trap the bad data and allow you to continue up to set LIMIT that you specify during create.

iconv -f utf-8 -t utf-8 -c file.txt

will clean up your UTF-8 file, skipping all the invalid characters.

-f is the source format
-t the target format
-c skips any invalid sequence
reisdev
  • 3,215
  • 2
  • 17
  • 38
  • So I tried converting the file using 'iconv -c'. I think some of the characters even though they were UTF-8 were messing things up – Pirate X Apr 18 '19 at 17:34