0

In a MySQL database of ancient Greek words, I want to look for all words which have a character with a dot underneath the letter (Unicode Character 'COMBINING DOT BELOW' (U+0323)) and remove this.

I tried various encodings, but none seems to work ...

MarkD
  • 37
  • 5

1 Answers1

0

I'll assume that you have a column col in table t, and col is CHARACTER SET utf8 (or utf8mb4).

SELECT ...
    FROM t
    WHERE HEX(col) REGEXP '^(..)*CCA3'

Will locate all lines with U+0323 (encoded as UTF-8 CCA3).

Alas, there is no easy way to modify col in SQL.

If you have MariaDB, then UNHEX(REGEXP_REPLACE(HEX(col), 'CCA3', '')) will almost work. The problem is that it could be misaligned (at an odd boundary).

Another thought: REPLACE(col, UNHEX('CCA3'), '') might work in MySQL or MariaDB. However, it could get into issues with Collations. So, perhaps this would work: REPLACE(col COLLATE utf8_bin, UNHEX('CCA3'), '')

Rick James
  • 135,179
  • 13
  • 127
  • 222