How do I write regexes for German character classes like letters, vowels, and consonants?

Question

For example, I set up these:

L = /[a-z,A-Z,ßäüöÄÖÜ]/
V = /[äöüÄÖÜaeiouAEIOU]/
K = /[ßb-zBZ&&[^#{V}]]/

So that /(#{K}#{V}{2})/ matches "ßäÜ" in "azAZßäÜ".

Are there any better ways of dealing with them?

Could I put those constants in a module in a file somewhere in my Ruby installation folder, so I can include/require them inside any new script I write on my computer? (I'm a newbie and I know I'm muddling this terminology; Please correct me.)

Furthermore, could I get just the meta-characters \L, \V, and \K (or whatever isn't already set in Ruby) to stand for them in regexes, so I don't have to do that string interpolation thing all the time?

your approach seems pretty sound. you can shorten K like this: `/[ßb-zB-Z&&[^aeiouAEIOU]]/` if you like. — Martin Ender, Apr 19 '13 at 09:49
Your "module in installation folder" is a gem. See http://guides.rubygems.org/ for more details. — knut, Apr 19 '13 at 12:24
Oh, thanks, yes, I ended up just putting the constants in another file in the same folder and putting `require '/.constants.rb'` in any script in that folder I need to use them in. Works for now. — Owen_AR, Apr 19 '13 at 16:19
Be sure to look at the POSIX and Unicode script extensions to the standard [Regexp](http://www.ruby-doc.org/core-2.0.0/Regexp.html#class-Regexp-label-Character+Properties) character classes. They're already tested and battle-hardened. — the Tin Man, Nov 18 '13 at 14:00

score 1 · Answer 1 · answered Nov 18 '13 at 15:10

You're starting pretty well, but you need to look through the Regexp class code that is installed by Ruby. There are tricks for writing patterns that build themselves using String interpolation. You write the bricks and let Ruby build the walls and house with normal String tricks, then turn the resulting strings into true Regexp instances for use in your code.

For instance:

LOWER_CASE_CHARS = 'a-z'
UPPER_CASE_CHARS = 'A-Z'
CHARS = LOWER_CASE_CHARS + UPPER_CASE_CHARS
DIGITS = '0-9'

CHARS_REGEX = /[#{ CHARS }]/
DIGITS_REGEX = /[#{ DIGITS }]/

WORDS = "#{ CHARS }#{ DIGITS }_"
WORDS_REGEX = /[#{ WORDS }]/

You keep building from small atomic characters and character classes and soon you'll have big regular expressions. Try pasting those one by one into IRB and you'll quickly get the hang of it.

score 0 · Answer 2 · answered Mar 29 '14 at 13:19

A small improvement on what you do now would be to use regex unicode support for categories or scripts.

If you mean L to be any letter, use \p{L}. Or use \p{Latin} if you want it to mean any letter in a Latin script (all German letters are).

I don't think there are built-ins for vowels and consonants.

See \p{L} match your example.

How do I write regexes for German character classes like letters, vowels, and consonants?

2 Answers2