Background
I've got data in a Postgres DB that has been incorrectly encoded at some point.
The DB is UTF-8 encoded. The table in question has a column which contains YAML-serialized data. Some rows contain non-ascii characters that seem to be represented by their two-byte UTF equivalents. It's easier to show:
> puts data
# ---
# :method_name: new
# :method_args:
# - "M\xC3\xB6bler"
# - ""
# - false
# - ""
# - test
# - f8685480-a36b-012f-54c1-1093e95ec0bb
> data.encoding
# => # <Encoding:UTF-8>
The \xC3\xB6
should be the character ö
.
You can get the same sort of result by doing this with a unicode string:
> string = "ö".force_encoding("ascii-8bit")
# => "\xC3\xB6"
In this case, however, the original bytes are retained so we can convert back to UTF:
> string.force_encoding("utf-8")
# => "ö"
Printing \xC3\xB6
seems to just be a way of displaying bytes that make no sense in ASCII-8BIT. You can illustrate this by calling .chars
:
> string.chars
# => ["\xC3", "\xB6"]
But in the strings that come from the database, \xC3\xB6
is actually eight characters.
> data[42..49].chars
# => ["\\", "x", "C", "3", "\\", "x", "B", "6"]
Because of this, you can't just force to ASCII-8bit and back again - this was my first attempt at a solution.
My next thought was to restore the original bytes somehow, but this turned out harder than I thought.
One possible (hackish) solution was suggested here: Best way to escape and unescape strings in Ruby?
That solution doesn't work for me, probably because the string represents YAML.
Question
How can I restore the original unicode characters?
I guess I could write a ginormous gsub-expression, but I rather avoid that.