1

I have an IMAP mail with a message string that looks like this:

message = #<Mail::Message:70152447148720, Multipart: false, Headers: <Return-Path: <apache@mail.gameseek.co.uk>>, <Received: by 10.86.68.12 with SMTP id q12cs352558fga; Mon, 9 Mar 2009 04:23:05 -0700 (PDT)>, <Received: by 10.210.137.14 with SMTP id k14mr2429643ebd.46.1236597783700; Mon, 09 Mar 2009 04:23:03 -0700 (PDT)>, <Received: from exproxy-2.exserver.dk (exproxy-2.exserver.dk [195.69.129.163]) by mx.google.com with ESMTP id 27si3500694ewy.75.2009.03.09.04.23.03; Mon, 09 Mar 2009 04:23:03 -0700 (PDT)>, <Received: by exproxy-2.exserver.dk (Postfix, from userid 65534) id DF2F6106EF3; Mon, 9 Mar 2009 12:13:26 +0100 (CET)>, <Received: from exsmtp01.exserver.dk (exsmtp01.exserver.dk [195.69.129.177]) by exproxy-2.exserver.dk (Postfix) with ESMTP id C2CEE106ED0 for <support_email.com@exfwd01.scannet.dk>; Mon, 9 Mar 2009 12:13:26 +0100 (CET)>, <Received: from exsmtp02.exserver.dk ([10.10.10.32]) by exsmtp01.exserver.dk with Microsoft SMTPSVC(6.0.3790.1830); Mon, 9 Mar 2009 12:22:19 +0100>, <Received: from front08.exserver.dk ([195.69.129.93]) by exsmtp02.exserver.dk with Microsoft SMTPSVC(6.0.3790.1830); Mon, 9 Mar 2009 12:22:19 +0100>, <Received: from localhost (front08.exserver.dk [127.0.0.1]) by front08.exserver.dk (Postfix) with ESMTP id F1B2BC4028 for <support@email.com>; Mon, 9 Mar 2009 12:46:22 +0100 (CET)>, <Received: from front08.exserver.dk ([127.0.0.1]) by localhost (front08.exserver.dk [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id mrYGo4G2pt13 for <support@email.com>; Mon, 9 Mar 2009 12:46:16 +0100 (CET)>, <Received: from mail.gameseek.co.uk (78.109.164.42.srvlist.ukfast.net [78.109.164.42]) by front08.exserver.dk (Postfix) with ESMTP id 99022C4021 for <support@email.com>; Mon, 9 Mar 2009 12:46:16 +0100 (CET)>, <Received: by mail.gameseek.co.uk (Postfix, from userid 48) id D321218DD2D; Mon, 9 Mar 2009 11:22:55 +0000 (GMT)>, <Date: Mon, 09 Mar 2009 11:22:55 +0000>, <From: myorder@gameseek.co.uk>, <Reply-To: myorder@gameseek.co.uk>, <To: support@email.com>, <Message-ID: <20090309112255.D321218DD2D@mail.gameseek.co.uk>>, <Subject: Gameseek Order Refunded: Gh68y1235386413>, <Delivered-To: my@email.com>, <Received-SPF: neutral (google.com: 195.69.129.163 is neither permitted nor denied by best guess record for domain of apache@mail.gameseek.co.uk) client-ip=195.69.129.163;>, <Authentication-Results: mx.google.com; spf=neutral (google.com: 195.69.129.163 is neither permitted nor denied by best guess record for domain of apache@mail.gameseek.co.uk) smtp.mail=apache@mail.gameseek.co.uk>, <X-Exserver-To: support_email.com@exfwd01.scannet.dk>, <X-Virus-Scanned: amavisd-new at exserver.dk>, <X-OriginalArrivalTime: 09 Mar 2009 11:22:19.0838 (UTC) FILETIME=[4F6005E0:01C9A0A9]>, <X-ScanNet-Forward: TTL=5>>

I now wish to give it a proper encoding:

unless message.multipart?
  charset = message.charset # => "UTF-8"
  if charset != nil
    body = message.body.decoded.force_encoding(charset).encode("UTF-8") # => "\n\nHello you,\n\nYour order or part of it has been refunded by Gameseek. The refund will be present on the same payment method you used when purchasing. If no other items are due to be posted to you the postage charge will also be refunded.\n\nPlease allow upto four working days for this refund to process.\n\nIf you have not contacted us about this order then it is most likely you are being refunded for an item we cannot currently get hold of.\n\nWe do apologise if this is the case, we would rather refund customers rather than having them wait weeks and weeks for an item.\n\nIf you have contacted us about this order then you will know why you are being refunded.\nMay we apologise if we have not met your requirements on this occassion.\n\nYour Order: Product | Category | Quantity | Cost\n---------------------------------------------------\nDragon Ball Z - Supersonic Warriors 2 | NintendoDS | 1 | \xA326.97\n\n\nFor all order enquires please contact myorder@gameseek.co.uk\n\nThank you for using Gameseek.\n"
  end
end

body = body.split(/Sent from my iPhone/)[0]

The last line raises the following error:

invalid byte sequence in UTF-8

Any idea how to fix this?

Cjoerg
  • 1,271
  • 3
  • 21
  • 63

1 Answers1

1

The text contains the invalid sequence \xA3. This represents a pound sign in Latin-1 (ISO-8859-1).

"\xA3".force_encoding('ISO-8859-1').encode('UTF-8')
#=> "£"

The quick fix is to replace invalid byte sequences in body with String#scrub, but that will remove them:

"\xA326.97".scrub('')
#=> "26.97"

However, to solve the "real" problem you should look earlier in the pipeline. The supplied charset seems to be wrong. Apparently the message is encoded in Latin-1, although the charset suggests something different. Maybe the problem is on the side of the sender.

Patrick Oscity
  • 53,604
  • 17
  • 144
  • 168
  • Very interesting about the issue maybe being in the senders end, but but just like Gmail, Outlook, etc. needs to deal with flaud input material, so must I. – Cjoerg Dec 12 '14 at 18:58
  • I have tried to do as in [this answer](http://stackoverflow.com/a/18454435/1413388) and clean the string with `string.encode!('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '')`, however this removes way too much, for example Scandinavian special characters. – Cjoerg Dec 12 '14 at 19:00
  • It should yield the same result as encode + scrub I think. As I said, it is definitely not the optimal solution. Encoding issues tend to end up very messy, especially when you have heterogenous data sources, such as mail. If you get a message that is encoded incorrectly - you can only guess. You may even receive a message that was encoded in latin-1 but uses only the ASCII characters and you will not notice that you used the wrong encoding. – Patrick Oscity Dec 12 '14 at 19:07
  • Hehe, right I also got to that conclusion right this moment :-) – Cjoerg Dec 12 '14 at 19:25
  • You could write a small function that forces some common encodings an pick the first where `valid_encoding?` is true, then fall back to `scrub`. – Patrick Oscity Dec 12 '14 at 19:33
  • Interesting. I will find a list of common encodings and then do exactly that. – Cjoerg Dec 12 '14 at 19:44
  • Got the idea here http://www.mobalean.com/blog/2011/09/02/guessing-a-strings-encoding-under-ruby-1-9 – Patrick Oscity Dec 12 '14 at 19:49
  • Thanks for the link. I did as suggested and it seems to work! – Cjoerg Dec 12 '14 at 22:27