0

I am trying to port a code from python to ruby, and having difficulties in one of the functions that encodes a UTF-8 string to JSON.

I have stripped down the code to what I believe is my problem.

I would like to make ruby output the exact same output as python.

The python code:

#!/usr/bin/env python
# encoding: utf-8

import json
import hashlib

text = "ÀÈG"
js = json.dumps( { 'data': text } )

print 'Python:'
print js
print hashlib.sha256(js).hexdigest()

The ruby code:

#!/usr/bin/env ruby
require 'json'
require 'digest'

text = "ÀÈG"
obj = {'data': text}
# js = obj.to_json # not using this, in order to get the space below
js = %Q[{"data": "#{text}"}]

puts 'Ruby:'
puts js
puts Digest::SHA256.hexdigest js

When I run both, this is the output:

$ ./test.rb && ./test.py
Ruby:
{"data": "ÀÈG"}
6cbe518180308038557d28ecbd53af66681afc59aacfbd23198397d22669170e
Python:
{"data": "\u00c0\u00c8G"}
a6366cbd6750dc25ceba65dce8fe01f283b52ad189f2b54ba1bfb39c7a0b96d3

What do I need to change in the ruby code to make its output identical to the python output (at least the final hash)?

Notes:

  • I have tried things from this SO question (and others) without success.
  • The code above produces identical results when using only english characters, so I know the hashing is the same.
Community
  • 1
  • 1
DannyB
  • 12,810
  • 5
  • 55
  • 65
  • Out of curiosity, why? JSON is typically transmitted with content-type `application/json; charset=utf-8`, so it may be fine to simply include these characters in your JSON. – mwp Aug 31 '16 at 06:30
  • Good question. I am trying to port a longer python code to ruby. It involves some AWS4 signatures (not fun...). My port works when using English characters as input, and I am getting errors from AWS4 when using non English characters. The above question is what I pinpointed as the first difference between the two implementations. – DannyB Aug 31 '16 at 06:37
  • OK. I've had to do similar things in the past for the Facebook Graph API, so I understand. Ruby and Python just handle these things so differently. Like, why is the "c" in `\u00c8G` in the Python example lowercase, but the "G" is capitalized? I can show you how to convert it to these escape sequences in Ruby, but Ruby capitalizes all the alphabetical characters, making this very tricky to get right. – mwp Aug 31 '16 at 06:39
  • Oh, herp, that's the G from the input data. Right. Working on a solution for you. – mwp Aug 31 '16 at 06:44

1 Answers1

1

Surely someone will come along with a more elegant (or at least a more efficient and robust) solution, but here's one for the time being:

#!/usr/bin/env ruby

require 'json'
require 'digest'

text = 'ÀÈG'
  .encode('UTF-16')                          # convert UTF-8 characters to UTF-16
  .inspect                                   # escape UTF-16 characters and convert back to UTF-8
  .sub(/^"\\u[Ff][Ee][Ff][Ff](.*?)"$/, '\1') # remove outer quotes and BOM
  .gsub(/\\u\w{4}/, &:downcase!)             # downcase alphas in escape sequences

js = { data: text }                          # wrap in containing data structure
  .to_json(:space=>' ')                      # convert to JSON with spaces after colons
  .gsub(/\\\\u(?=\w{4})/, '\\u')             # remove extra backslashes

puts 'Ruby:', js, Digest::SHA256.hexdigest(js)

Output:

$ ./test.rb 
Ruby:
{"data": "\u00c0\u00c8G"}
a6366cbd6750dc25ceba65dce8fe01f283b52ad189f2b54ba1bfb39c7a0b96d3
mwp
  • 8,217
  • 20
  • 26
  • wow thats a mouthful... :) - thanks for the effort and well documented answer. – DannyB Aug 31 '16 at 07:28
  • You're quite welcome! As I mentioned above, I've wrestled with this before so I had a good idea where to start from. Good luck! – mwp Aug 31 '16 at 07:30
  • If nothing else, this answer makes me think the problem lies elsewhere. I will of course accept the answer once I have made some more tests, and once the chance of other illuminating answers fades. – DannyB Aug 31 '16 at 07:58
  • 1
    @DannyB The problem is that the existing code that you're trying to port relies on very specific ways that Python 1. encodes data as UTF-16, 2. represents UTF escape sequences, and 3. formats encoded JSON. Ruby does it differently—not wrong, just differently—and the checksums don't match. All or most of this could be avoided by just working with UTF-8 data, which is perfectly valid JSON, but unfortunately you're dealing with an external system with some "quirks" that you must emulate. – mwp Aug 31 '16 at 19:07