4

I'm trying to decode what I think is some quoted-printable encoded text that appears in an MBox email archive. I will give one example of some text I am having trouble with.

In the MBox, the following text appears:

"Demarcation by Theresa Castel=E3o-Lawless"

Properly decoded, I think this should appear as:

"Demarcation by Theresa Castelão-Lawless"

I'm basing my statement of what it should properly look like both off of

1) a web archive of the email in which the text is properly rendered as "Demarcation by Theresa Castelão-Lawless"

and 2) this page, which shows "=E3" as corresponding to a "ã" for quoted-printable https://www.ic.unicamp.br/~stolfi/EXPORT/www/ISO-8859-1-Encoding.html

I've tried the code below but it gives the wrong output.


string = "Demarcation by Theresa Castel=E3o-Lawless"

decoded_string = Mail::Encodings::QuotedPrintable.decode(string)

puts decoded_string + "\n"

The result from the code above is "Demarcation by Theresa Castel?o-Lawless" but as stated above, I want "Demarcation by Theresa Castelão-Lawless"

JustinCEO
  • 55
  • 4
  • Regarding 2) that page is all about ISO-8859-1 aka ISO Latin 1. In Ruby, strings are UTF-8 by default. – Stefan Jul 16 '19 at 12:23

1 Answers1

3

Try to avoid weird Rails stuff when you have plain old good ruby to accomplish a task. String#unpack is your friend.

"Demarcation by Theresa Castel=E3o-Lawless".
  unpack("M").first. # unpack as quoted printable
  force_encoding(Encoding::ISO_8859_1).
  encode(Encoding::UTF_8)
#⇒ "Demarcation by Theresa Castelão-Lawless"

or, as suggested in comments by @Stefan, one can pass the source encoding as the 2nd argument:

"Demarcation by Theresa Castel=E3o-Lawless".
  unpack("M").first. # unpack as quoted printable
  encode('utf-8', 'iso-8859-1')

Note: force_encoding is needed to tell the engine this is single-byte ISO with european accents before encoding into target UTF-8.

Aleksei Matiushkin
  • 119,336
  • 10
  • 100
  • 160
  • You can pass the source encoding as the 2nd argument: `encode('utf-8', 'iso-8859-1')` – Stefan Jul 16 '19 at 12:20
  • @Stefan yes, I decided that way it’s somewhat more explicit. Maybe I’m wrong, I’ll update the answer. – Aleksei Matiushkin Jul 16 '19 at 12:51
  • 1
    @AlekseiMatiushkin I'm unsure how to handle this on SO, but there is another very similar question: https://stackoverflow.com/questions/3473952/is-there-a-way-to-decode-q-encoded-strings-in-ruby/67418207#67418207 Strictly speaking, it is not a duplicate because this question has a string which only contains hexidecimal sequences, without the QP delimiters, charset and code. `unpack` is still a great solution in that case as well, I wrote complete example code around it. A reference to the other Q/A might be helpful to readers here. – Richard Michael May 07 '21 at 10:06