3

Given a proto file:

syntax = "proto3";
package hello;

message TopGreeting {
    NestedGreeting greeting = 1;
}

message NestedGreeting {
    Greeting greeting = 1;
}

message Greeting {
    string message = 1;
}

and the code:

public class Main {
    public static void main(String[] args) {
        System.out.printf("From top: %s%n", newGreeting("오늘은 무슨 요일입니까?"));
        System.out.printf("Directly: %s%n", "오늘은 무슨 요일입니까?");
        System.out.printf("ByteString: %s", newGreeting("오늘은 무슨 요일입니까?").toByteString().toStringUtf8());
    }

    private static Hello.TopGreeting newGreeting(String message) {
        Hello.Greeting greeting = Hello.Greeting.newBuilder()
                .setMessage(message)
                .build();
        Hello.NestedGreeting nestedGreeting = Hello.NestedGreeting.newBuilder()
                .setGreeting(greeting)
                .build();
        return Hello.TopGreeting.newBuilder()
                .setGreeting(nestedGreeting)
                .build();
    }
}

Output

From top: greeting {
  greeting {
    message: "\354\230\244\353\212\230\354\235\200 \353\254\264\354\212\250 \354\232\224\354\235\274\354\236\205\353\213\210\352\271\214?"
  }
}

Directly: 오늘은 무슨 요일입니까?

ByteString: 
%
#
!오늘은 무슨 요일입니까?

How do I print the message in a human-readable way? As you can see, converting to ByteString prints the UTF-8 characters alright, but also prints some other garbage % and #.

Abhijit Sarkar
  • 21,927
  • 20
  • 110
  • 219
  • Is it possible that the source code or those string literals are in UTF16 or something other than UTF8? The thing that's got my attention is that it has output things like "\354\230\244", but then the spaces are intact. Some of those numbers are >255, hence my wondering if it's trying to output 16 bit values. If it were dumping UTF8 as byte values, I'd expect them to be <255. – bazza Jul 18 '20 at 09:37
  • Hello agan, I found in this answer https://stackoverflow.com/a/2164888/2147218 that Java strings are UTF16, which may have something to do with how the strings are appearing in the debug output. If the GPB class were expecting its buffer to contain UTF8 encoded text, but actually it contained UTF16 encoded text, then it would print out strangely; the two encodings are not compatible. I'm wondering if you can use something like this answer https://stackoverflow.com/a/5729828/2147218 to convert your string literal to UTF8 before initialising a `newgreeting`? – bazza Jul 18 '20 at 13:04
  • 1
    @bazza see my answer. Almost always, the truth is in the source code. – Abhijit Sarkar Jul 18 '20 at 22:47

2 Answers2

5

Answering my own question, I solved this issue by digging through Protobuf source code.

System.out.println(TextFormat.printer().escapingNonAscii(false).printToString(greeting))

Output:

greeting {
  greeting {
    message: "오늘은 무슨 요일입니까?"
  }
}

toString uses the same mechanism but with escapingNonAscii(true) (default when omitted).

Also see this answer for how to convert Octal sequences to UTF-8 characters in case you don't have access to the source code, only logs.

Abhijit Sarkar
  • 21,927
  • 20
  • 110
  • 219
  • Good find :-) I should have spotted the octal... I've found ref here https://developers.google.com/protocol-buffers/docs/reference/java/com/google/protobuf/TextFormat.Printer.html#escapingNonAscii-boolean- The default tostring() certainly seems to be pretty rubbish. The string encoding in the object is supposed to be UTF8, so one would think that it would at least try to not print it as 7 bit ascii. I presume then it's relying on the stdout understanding UTF8 - which clearly yours does - but it's not guaranteed universally. I'm wondering if there's different behaviour on Windows and Linux. – bazza Jul 19 '20 at 08:32
-1

The protobuf binary format isn't human readable and you shouldn't attempt to make it so. There is a JSON variant if you need, but frankly it would be better to log the interpreted data, not the payloads.

Marc Gravell
  • 1,026,079
  • 266
  • 2,566
  • 2,900
  • 2
    I disagree. Almost always, one part of a response doesn’t stand on it’s own, and it’s interpretation depends on the other parts. Seeing the whole message is crucial for debugging, and works as expected with ASCII charset. What boggles my mind is that Google went out of their way to obscure what’s printed, – Abhijit Sarkar Jul 18 '20 at 07:32
  • 1
    @AbhijitSarkar, you have misunderstood the purpose of GPB. Google designed it as a binary serialiser specifically to save storage space. Text serialisations, which can be clumsily read as plain text, take up a lot more room and take longer to send via a network connection. – bazza Jul 18 '20 at 09:14
  • 2
    @bazza I think you misunderstood my point. No one is stopping Google to do what's best on the wire; I'm talking about _printing messages for debugging_, not transmitting them anywhere. Debugging, still, is done by programmers, who are usually human. – Abhijit Sarkar Jul 18 '20 at 09:16
  • @AbhijitSarkar ah I see, I'm sorry. Hang on a mo and I'll do some digging. My first instinct is that GBP has specific ways of representing strings, and that it can get lost if a different character encoding gets put into it. – bazza Jul 18 '20 at 09:19
  • @Abhijit so long the *contents*, not the serialization payload, which is what you seem to be doing right now. The serialization payload is not *intended* to be readable. – Marc Gravell Jul 19 '20 at 00:45
  • 1
    @MarcGravell There is no ambiguity in my question or sample code. – Abhijit Sarkar Jul 19 '20 at 02:08