How To Convert Latin-1 To UTF-8 With Elixir?

Question

Elixir 1.3.0

Windows 10

Postgrex 0.11.2

Ecto 2.0.1

Postgres 9.4.4

I'm attempting to add records to a PostgreSQL database via Ecto. When I get to a string containing \x0087 it throws the following error:

** (Postgrex.Error) ERROR (character_not_in_repertoire): invalid byte sequence for encoding "UTF8": 0x87

I'm pretty sure it's an issue with the file itself which as far as I can tell is encoded as Latin1. This is the code I use to open the file and read it in:

:ok = :io.setopts(:standard_io, encoding: :latin1)
File.open!(file)
|> IO.binstream(:line)

The file opens fine and in fact several lines are processed just fine until it gets to a line that contains \x0087.

What I can't quite figure out is how to convert the line which is read in with latin1 encoding into UTF-8 encoding. I found String.normalize which seems like it might help with the conversion but I know I'm grasping at straws.

I changed the encoding: parameter on the :io.setopts line to :utf8 but it doesn't seem to make a difference.

Is there some simple way to convert an ANSI/Latin1 encoded string to a UTF-8 encoded string?

I don't think the byte 0x87 is valid in latin1: https://en.wikipedia.org/wiki/ISO/IEC_8859-1#Codepage_layout — Dogbert, Jun 24 '16 at 20:49
It may not be. The file isn't actually Latin 1. It's actually Windows-1252. — Onorio Catenacci, Jun 25 '16 at 16:02

score 0 · Answer 1 · edited May 23 '17 at 12:08

0

I'm really hesitant to answer my own question but I think using the techniques found in this Q & A is the right answer here as well. Basically need to convert from CP-1252 to UTF-8 and then everything works as expected.

edited May 23 '17 at 12:08

Community

1
1

answered Jun 27 '16 at 12:38

Onorio Catenacci

14,928
14
81
132

How To Convert Latin-1 To UTF-8 With Elixir?

1 Answers1