43

I am playing with the Unix hexdump utility. My input file is UTF-8 encoded, containing a single character ñ, which is C3 B1 in hexadecimal UTF-8.

hexdump test.txt
0000000 b1c3
0000002

Huh? This shows B1 C3 - the inverse of what I expected! Can someone explain?

For getting the expected output I do:

hexdump -C test.txt
00000000  c3 b1                                             |..|
00000002

I was thinking I understood encoding systems.

tripleee
  • 175,061
  • 34
  • 275
  • 318
zedoo
  • 10,562
  • 12
  • 44
  • 55

2 Answers2

56

This is because hexdump defaults to using 16-bit words and you are running on a little-endian architecture. The byte sequence b1 c3 is thus interpreted as the hex word c3b1. The -C option forces hexdump to work with bytes instead of words.

Marcelo Cantos
  • 181,030
  • 38
  • 327
  • 365
  • I was thinking it must have something to do with endianness. – zedoo May 17 '10 at 08:18
  • 8
    but why hexdump default to this confusing output format? is there any historic reason? – accuya Mar 01 '12 at 12:05
  • 3
    What's confusing is the propensity for humans to encode numbers in big-endian order. Little-endian is more logical, which is why it's used on many CPU architectures, including x86, in spite of the awkwardness. – Marcelo Cantos Mar 02 '12 at 02:32
  • 6
    Actually big-endian and little-endian each have their strengths and weaknesses. Neither is "more logical" in an absolute sense. – Marko Topolnik Apr 15 '16 at 06:40
  • @MarceloCantos, what's confusing is that it assumes 16bit words little endian. What is the logic in choosing 16bit words? Or any other word length? IMO makes more sense to default to big endian representation which would look the same regardless of word length thus much less confusing in this use case. – akostadinov Dec 29 '16 at 07:51
  • 2
    Purely conjecture, but the historic reason is almost certainly that hexdump was initially implemented on a little endian machine that used 16 bit words and it was a perfectly reasonable default. – William Pursell Jun 01 '17 at 13:20
4

I found two ways to avoid that:

hexdump -C file

or

od -tx1 < file

I think it is stupid that hexdump decided that files are usually 16bit word little endian. Very confusing IMO.

akostadinov
  • 17,364
  • 6
  • 77
  • 85