5

Is it possible to read one UTF-8 character from file?

file:read(1) return weird characters instead, when i print it.

function firstLetter(str)
  return str:match("[%z\1-\127\194-\244][\128-\191]*")
end

Function returns one UTF-8 character from string str. I need to read one UTF-8 character this way, but from input file (don't want to read certain file to the memory - via file:read("*all"))

Question is pretty similar to this post: Extract the first letter of a UTF-8 string with Lua

Community
  • 1
  • 1
Hrablicky
  • 143
  • 1
  • 9
  • 1
    One pretty straightforward but for sure not very popular way is to really "parse the bytes (1..6) and convert them to a UTF-32 value". Using UTF-32 can make stuff easier in some cases, depending on what you are going to do. – BitTickler Apr 24 '15 at 19:55
  • Do what that function does while manually reading a character at a time? Though that will end you up having read one more character then you needed so you'll need to rewind. – Etan Reisner Apr 24 '15 at 20:28
  • im going to create typography corrector (which can read also Czech characters), so I'm going to read the input file, find the mistakes and correct it. But it's impossible to work with (for Lua unknown characters). Original text: ľúbozvučně řeřicha čučoridka ľaľia Text which was read by Lua (in Zero Brane Studio): [link](http://i.imgur.com/PcorbzP.png) when I compare first char from both, it doesn't match – Hrablicky Apr 24 '15 at 20:31

3 Answers3

3
function read_utf8_char(file)
  local c1 = file:read(1)
  local ctr, c = -1, math.max(c1:byte(), 128)
  repeat
    ctr = ctr + 1
    c = (c - 128)*2
  until c < 128
  return c1..file:read(ctr)
end
Egor Skriptunoff
  • 23,359
  • 2
  • 34
  • 64
  • 2
    This is an exact answer to the question but not a good answer without an explanation. – Tom Blodget Apr 24 '15 at 21:16
  • @TomBlodget - Your judgement is incorrect: as you see, no one has asked for any clarification of my answer. It looks like you are treating people as stupid creatures, so everything must be explained in details. On the contrary, I think people are smart enough. Of course, I am ready to give extra explanations if someone will tell me which part of my answer is unclear for him. – Egor Skriptunoff Apr 25 '15 at 11:33
  • @TomBlodget - "Your audience is smarter than you imagine." (a quote from [13 Writing Tips From Chuck Palahniuk](http://litreactor.com/essays/chuck-palahniuk/stocking-stuffers-13-writing-tips-from-chuck-palahniuk), tip #2) – Egor Skriptunoff Apr 25 '15 at 13:37
  • thanks, I of course understand the idea, but it's not working for this case.. When I use this function once it still returns SYN like at the picture [link](http://i.imgur.com/PcorbzP.png) and when I try to compare this first utf8 char with ľ (first character in original text) it return false.. But thank you it seemed to be very elegant solution, have no idea why it doesn't work – Hrablicky Apr 26 '15 at 12:33
  • @Hrablicky - Check result with `print(read_utf8_char(file):byte(1,-1))` – Egor Skriptunoff Apr 26 '15 at 13:52
  • @EgorSkriptunoff nice, this way I could use it pretty well, thank you again – Hrablicky Apr 27 '15 at 10:21
  • This answer works but requires the input file to contain well-formed UTF-8. If the input file has invalid UTF-8 then it may return bogus results. – hugomg Aug 12 '16 at 00:25
  • 1
    @EgorSkriptunoff: Stackoverflow users are not stupid but its hard for readers to gauge the quality of a code-only answer. In this particular case, its hard to know how the answer works without knowing the intrincacies of the UTF-8 format. You should have said that the number of continuation bytes can be determined by reading the first byte. – hugomg Aug 12 '16 at 00:31
0

You need to read characters so that the string you are matching always has four or more of them (which will allow you to apply the logic from the answer you referenced). If after matching and removing a UTF-8 character then length is len, you then read from the file 4-len characters.

ZeroBrane Studio replaces invalid UTF-8 characters with [SYN] character when printed to the Output panel (as you see in the screenshot). This blogpost describes the logic behind the detection of invalid UTF-8 characters (in Lua) and their handling in ZeroBrane Studio.

Paul Kulchenko
  • 25,884
  • 3
  • 38
  • 56
0

In the UTF-8 encoding the number of bytes taken for a character is determined by the first byte of that character, according to the following table (taken from RFC 3629:

Char. number range  |        UTF-8 octet sequence
   (hexadecimal)    |              (binary)
--------------------+---------------------------------------------
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

If the highest bit of the first byte is "0", then the character has only one byte. If the highest bits are "110" then the character has 2 bytes and so on.

What you can then do is read one byte from the file and determine how many continuation bytes you need to read to the the full UTF-8 character:

function get_one_utf8_character(file)

  local c1 = file:read(1)
  if not c1 then return nil end

  local ncont
  if     c1:match("[\000-\127]") then ncont = 0
  elseif c1:match("[\192-\223]") then ncont = 1
  elseif c1:match("[\224-\239]") then ncont = 2
  elseif c1:match("[\240-\247]") then ncont = 3
  else
    return nil, "invalid leading byte"
  end

  local bytes = { c1 }
  for i=1,ncont do
    local ci = file:read(1)
    if not (ci and ci:match("[\128-\191]")) then
      return nil, "expected continuation byte"
    end
    bytes[#bytes+1] = ci
  end

  return table.concat(bytes)
end
hugomg
  • 68,213
  • 24
  • 160
  • 246