Effectively UTF-8 encode a string

Question

I'm running a script on WSL Debian which fetches Windows files from a locally mounted share drive. Issue is that the file names are wrongly encoded, even-though #encoding returns #<Encoding:UTF-8>. Example:

"J\u00E9r\u00E9my".encoding  # #<Encoding:UTF-8>

\u00E9 is the Unicode character for é, so I assume that the encoding is Unicode

I've tried several encoding combination from related questions (Convert a unicode string to characters in Ruby?, How to convert a string to UTF8 in Ruby), but none of the fit my needs. I've also tried different "magic comments" encoding: <ENCODING>, without satisfying result.

What's your methodology to identify and fix encoding issues ?

Edit1: Stefan asked for codepoints:

"J\u00E9r\u00E9my".each_codepoint.to_a
# [74, 233, 114, 233, 109, 121]

and Encoding.default_external

Encoding.default_external
# #<Encoding:US_ASCII>

Which surprises me, as I've the magic comment # encoding: utf-8 at the top of my file

Edit2: explicitely setting default_internal and default_external encoding to Encoding::UTF_8 fixes the problem

# encoding: utf-8

Encoding.default_internal = Encoding::UTF_8
Encoding.default_external = Encoding::UTF_8

Though I'd like to go further and actually understand why this is required

Can you show the string's [`codepoints`](https://ruby-doc.org/core-2.7.2/String.html#codepoints-method)? Also what does [`Encoding.default_external`](https://ruby-doc.org/core-2.7.2/Encoding.html#default_external-method) return? — Stefan, Dec 01 '20 at 14:34
The encoding comment *in the file* sets the encoding *in the file*. It does not change the encoding of the Windows file system. How would it even do that? — Jörg W Mittag, Dec 01 '20 at 16:16
`# encoding: utf-8` is just about the encoding of the file, and it is used only (and not even always) by your editor. Compilers may use it, but just for the very first phase: reading the file) — Giacomo Catenazzi, Dec 02 '20 at 10:12

Stefan · Accepted Answer · 2020-12-01T15:09:31.830

1

"J\u00E9r\u00E9my".encoding
#=> #<Encoding:UTF-8>
"J\u00E9r\u00E9my".each_codepoint.to_a
#=> [74, 233, 114, 233, 109, 121]

The strings are perfectly fine. They contain the correct bytes and have the correct encoding.

They are printed this way because your external encoding is set to (or recognised as) US-ASCII:

Encoding.default_external
#=> #<Encoding:US_ASCII>

Ruby assumes that your terminal can only render ASCII characters and therefore prints UTF-8 characters using escape sequences. (when using p / String#inspect)

The external encoding is usually determined automatically based on your locale:

$ LANG=C            ruby -e 'p Encoding.default_external'
#<Encoding:US-ASCII>

$ LANG=en_US.UTF-8  ruby -e 'p Encoding.default_external'
#<Encoding:UTF-8>

Setting your terminal's or system's encoding / locale to UTF-8 should fix the problem.

edited Dec 01 '20 at 15:09

answered Dec 01 '20 at 14:52

Stefan

109,145
14
143
218

For future visitors: note that [String#codepoints](https://ruby-doc.org/core-2.7.2/String.html#codepoints-method) is shorthand for `str.each_codepoint.to_a`. The result will be the same either way. – Todd A. Jacobs Dec 01 '20 at 14:59
Indeed, it came from my terminal settings. Despite the fact that WSL' terminal says it's using UTF-8, running the script from another terminal prints the accentuated characters properly. I'll investigate WSL settings, thanks for guiding me to the right direction ! – Sumak Dec 01 '20 at 15:06

Effectively UTF-8 encode a string

1 Answers1