0

If I pass bytes that are invalid UTF-8 to Node and try to read them from process.argv, I get the replacement character (U+FFFD)

$ node -e "console.log(process.argv[1].codePointAt(0).toString(16))" $'\x7f'
7f
$ node -e "console.log(process.argv[1].codePointAt(0).toString(16))" $'\x80'
fffd

whereas command line tools implemented in C get the raw bytes

$ echo -n $'\x7f' | xxd
00000000: 7f                                       .
$ echo -n $'\x80' | xxd
00000000: 80                                       .

Is there a way to get the raw bytes of the arguments that were passed in to my Node program without having Node try to decode them as a UTF-8 string?

Nick Parsons
  • 45,728
  • 6
  • 46
  • 64
Boris Verkhovskiy
  • 14,854
  • 11
  • 100
  • 103
  • I'm quite certain it's not possible but wanted to ask to make sure. – Boris Verkhovskiy Nov 17 '22 at 10:55
  • This is possible in Python because it uses its "`surrogateescape`" encoding which encodes these invalid bytes as `0xDCxx` surrogates which have no meaning in UTF-8, which lets you recover the initial data: https://stackoverflow.com/questions/27185295 This can't be done in Node since it uses UTF-16 and hence those surrogates have a meaning. – Boris Verkhovskiy Nov 18 '22 at 07:05

0 Answers0