1

Background

I've got data in a Postgres DB that has been incorrectly encoded at some point.

The DB is UTF-8 encoded. The table in question has a column which contains YAML-serialized data. Some rows contain non-ascii characters that seem to be represented by their two-byte UTF equivalents. It's easier to show:

> puts data
#  ---
#  :method_name: new
#  :method_args:
#  - "M\xC3\xB6bler"
#  - ""
#  - false
#  - ""
#  - test
#  - f8685480-a36b-012f-54c1-1093e95ec0bb

> data.encoding
# => # <Encoding:UTF-8>

The \xC3\xB6 should be the character ö.

You can get the same sort of result by doing this with a unicode string:

> string = "ö".force_encoding("ascii-8bit")
# => "\xC3\xB6"

In this case, however, the original bytes are retained so we can convert back to UTF:

> string.force_encoding("utf-8")
# => "ö"

Printing \xC3\xB6 seems to just be a way of displaying bytes that make no sense in ASCII-8BIT. You can illustrate this by calling .chars:

> string.chars
# => ["\xC3", "\xB6"]

But in the strings that come from the database, \xC3\xB6 is actually eight characters.

> data[42..49].chars
# => ["\\", "x", "C", "3", "\\", "x", "B", "6"]

Because of this, you can't just force to ASCII-8bit and back again - this was my first attempt at a solution.

My next thought was to restore the original bytes somehow, but this turned out harder than I thought.

One possible (hackish) solution was suggested here: Best way to escape and unescape strings in Ruby?

That solution doesn't work for me, probably because the string represents YAML.

Question

How can I restore the original unicode characters?

I guess I could write a ginormous gsub-expression, but I rather avoid that.

Community
  • 1
  • 1
Jesper
  • 4,535
  • 2
  • 22
  • 34

1 Answers1

1

I guess I could write a ginormous gsub-expression, but I rather avoid that.

Not really that ginormous :)

string = "M\\xC3\\xB6bler"
string.encoding
# => #<Encoding:UTF-8>

puts string.gsub(/\\x([0-9a-zA-Z]{2})/) { $1.to_i(16).chr }
# => Möbler
Amadan
  • 191,408
  • 23
  • 240
  • 301