3

How can I decode a single character from a vector of octets in common lisp?

I want something like:

(decode-character vector :start i :encoding :utf-8)

or more specifically:

(decode-character #(195 164 195 173 99 195 176) :start 0)
=> #\LATIN_SMALL_LETTER_A_WITH_DIAERESIS

which would return the UTF-8 encoded character that starts at position i in vector.

I can't figure out how to do that using either babel or flexi-streams.

Thayne
  • 6,619
  • 2
  • 42
  • 67

2 Answers2

2
(defun decode-character (vector &rest args)
  (char (apply #'babel:octets-to-string
               (coerce vector '(vector (unsigned-byte 8))) args)
        0))
huaiyuan
  • 26,129
  • 5
  • 57
  • 63
  • I would not `coerce` but `check-type` before. – Svante Oct 18 '15 at 21:28
  • The `encoding` keyword parameter seems relevant for the question. – Svante Oct 18 '15 at 21:34
  • This would work if it used the start and end keyword arguments as in @coredump's answer. I was hoping for something that didn't unnecessarily create a string, but this works at least. – Thayne Oct 19 '15 at 02:05
  • @Thayne If you need to work at a lower level, you can also look how octets-to-string is implemented. Maybe introducing a custom `read-character` function that gets the next character from a stream is not too hard to implement. – coredump Oct 20 '15 at 09:11
1

This is maybe not what you are looking for (I'd gladly update if I can). I did not look at Babel, but you could generalize the approach for other encodings I guess. I'll stick with trivial-utf-8 here. I would do this:

(defun decode-utf-8-char (octet-vector &key (start 0))
  (char (trivial-utf-8:utf-8-bytes-to-string 
          octet-vector
          :start start
          :end (+ start 4)) 0))

Gives the result you want with your example vector. The reason it works is because utf-8 characters are at most 4 bytes long. The call to char is here to grab the first character in case more than one were actually read.

Community
  • 1
  • 1
coredump
  • 37,664
  • 5
  • 43
  • 77