1

In a MySQL database of ancient Greek words (collated in utf8mb4_unicode_ci) I am trying to detect which words are starting with a capital and in that case add the value 1 to a field include. I have tried several options using binary, e.g.

UPDATE word
SET include = 1
WHERE LEFT(`lemma`, 1) REGEXP BINARY '[Α-Ω]'

But also other solutions with UPPER etc. None works. Any ideas?

Rick James
  • 135,179
  • 13
  • 127
  • 222
MarkD
  • 37
  • 5
  • 2
    check this: http://stackoverflow.com/questions/5629111/how-can-i-make-sql-case-sensitive-string-comparison-on-mysql – Jenish Apr 27 '17 at 08:24
  • OK great, thanks: UPDATE word SET include = 1 WHERE BINARY LEFT(lemma, 1) = BINARY UPPER(LEFT(lemma,1)) solved the problem – MarkD Apr 27 '17 at 08:41
  • Welcome to Stack Overflow. You can format source code with the [Code Sample `{}`](https://i.imgur.com/3ppJhiE.png) toolbar button—I've done it for you this time. You also have a preview pane right below the editor so you can see how changes look like before you publish them. – Álvaro González Apr 27 '17 at 08:48

1 Answers1

0

REGEXP only works with bytes, so it is hit or miss on non-English letters.

MariaDB 10.0.5 has "pcre", so it does handle utf8 regexps.

Α-Ω is hexCE91-CEA9.

WHERE HEX(lemma) REGEXP '^CE'

Would check for starting with a Greek letter. (It would catch all of ΄΅Ά·ΈΉΊΌΎΏΐΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩΪΫάέήίΰαβγδεζηθικλμνξο.)

To restrict to just the capital letters, Α-Ω (as you requested):

WHERE HEX(LEFT(lemma, 1)) BETWEEN 'CE91' AND 'CEA9'

While I am at it, this checks for Greek anywhere in lemma:

WHERE HEX(lemma) REGEXP '^(..)*CE'

(Again assuming that longer list of letters, not just Α-Ω.)

(My comments apply to CHARACTER SET utf8 or utf8mb4; any COLLATION.)

Rick James
  • 135,179
  • 13
  • 127
  • 222