3

What is the encoding of Nginx's access.log? I'm trying to iterate through the file but my script raises an exception "invalid byte sequence in UTF-8" when it sees requests to my server that have Chinese/Thai characters in them.

Community
  • 1
  • 1
Tom
  • 9,275
  • 25
  • 89
  • 147

1 Answers1

1

An HTTP protocol request is mostly ASCII, with data fields allowed to be any octets. See Which encoding is used by the HTTP protocol? .

Nginx's error and access logs will reflect that.

My experience with 450,000 records from a small North American site revealed that all but one record decoded to ASCII without error. That record contained 4 consecutive bytes (b'\xb8E\x8c\xde') that were invalid UTF-8 but were valid big5hkscs (Python's codec for Traditional Chinese) that produced two glyphs.

See How can I programmatically find the list of codecs known to Python? to get a list of codec names for a brute force attack on non-ASCII bits.

Decoding the binary log records to ASCII, with '?' replacements for errors, was good enough for my needs.

Community
  • 1
  • 1
jkellydresser
  • 51
  • 1
  • 3