0

Inserting in any table in monetdb v11.39.5 inserts  in its place. Apparently the character is not shown well here. The character is a failed attempt at writing "€". It shows as \200 in Emacs and as a square with numbers in it in Eclipse and Firefox (0 0 in the top row and 8 0 in the bottom row). Inserting this character in a SQLite database set to UTF-8 encoding inserts exactly this character. To reproduce the problem:

create schema if not exists test;

create table if not exists test.test(c text);

INSERT INTO test.test(c) VALUES('');

select * from test.test;

I'm on Debian GNU/Linux 10 (buster). This character is valid UTF-8 according to https://onlineutf8tools.com/validate-utf8 and How to check whether a file is valid UTF-8?.

The output of xxd -a FILENAME with the character in FILENAME is 00000000: c280 .. when the file is saved with Emacs and 00000000: c280 0a ... when the file is saved with gedit.

echo $LANG prints fr_FR.UTF-8 on my computer

Thank you

  • the byte \200 by itself is not valid UTF-8. It is a so called continuation byte and only occurs as part of a longer sequence. Could you show a complete example, containing a CREATE TABLE statement and an INSERT statement? For completeness, could you also put the example in a file and show the output of `xxd -a FILENAME`? We can use this to get an byte-exact replica of your example file. Finally, could you show us the output of `echo $LANG` so we know your locale? Best regards, Joeri – Joeri van Ruth Nov 26 '20 at 14:50
  • That looks like a double encoding problem, where a string of bytes that are already in UTF-8 are encoded again. The bottom 127 character will stay constant, but the rest will become mangled. – Dragonthoughts Nov 26 '20 at 16:14

1 Answers1

0

Your file is indeed correctly encoded. The bytes C2 80 are the UTF-8 encoding of the Unicode code point U+0080, which is not the EURO sign but some control character.

If I try your example on my system, which is also Debian 10, with a home-compiled Oct2020-SP1, I get

$ mclient -d foo t.sql
+------+
| %2   |
+======+
|      |
: ...  >
+------+
1 tuple !1 field truncated!
note: to disable dropping columns and/or truncating fields use \w-1

$ mclient -d foo t.sql -fraw
% .%2 # table_name
% %2 # name
% char # type
% 1 # length
[ "\302\200"    ]

Octal 302 is hex C2 and octal 200 is hex 80 so it seems the U+0080 is not being corrupted, it comes back unchanged. Why on your system it's changed into a 'Â' I have no idea. The UTF-8 encoding of  is C3 82.