2

I'm trying to create a regular expression that matches words with the following condition:

  • Match words which can contain characters like: æøå, and numbers.
  • If a word contains any of the following characters, it is invalid:
    + - & | ! ( ) { } [ ] ^ " ~ * ? : \

So for example these words are okay:

testæøå
test12
12test

But these should fail:

t+st
te&st

Martin Ender
  • 43,427
  • 11
  • 90
  • 130
Johnny
  • 785
  • 1
  • 7
  • 16
  • If utf-8 / character properties are available: `/^(\p{L}|[0-9])*$/` is probably what you are looking for. – Wrikken Dec 14 '12 at 21:52
  • which language/tool are you using? – Martin Ender Dec 14 '12 at 21:53
  • 1
    You don't need a regex for this (if you are using a programming language with a "string" type). – Lev Levitsky Dec 14 '12 at 21:54
  • @LevLevitsky: You don't *need* a regular expression for *anything*, but this seems to be one of the tasks that regular expressions are well-suited to. – ruakh Dec 14 '12 at 21:56
  • @Lev well my problem is that I have an input from an email subject field and need to pass it into a webservice which cant handle the characters, so I must first 'sanitize' the subject field and then pass it in – Johnny Dec 14 '12 at 21:56
  • @ruakh What I mean is that the condition to be checked is not really a pattern; rather, a list of forbidden characters. Checking if a string contains any of characters from a set is not something I'd do with regex, because it doesn't win much here. – Lev Levitsky Dec 14 '12 at 22:03
  • Yea, I see maybe I should just go the string manipulation route... since I'm no expert in regex. – Johnny Dec 14 '12 at 22:06
  • +1 for "keeping an open mind" as advised in [ask]. – Lev Levitsky Dec 14 '12 at 22:39

2 Answers2

1

How about something like [^+\-|!(){}\[\]\^"~*?:\\]+. This will match anything that does not contain the characters you want to exclude. You'll have to check that I've backslashed the meta-characters rightly within the enclosing [ and ]

DWright
  • 9,258
  • 4
  • 36
  • 53
  • Almost... but it seem to match this word: "add-in" where it highlights add and in, but it contains a hyphen and so should not match "add-in" – Johnny Dec 14 '12 at 21:59
  • Could you post the line(s) of code where you use the regex pattern? – DWright Dec 14 '12 at 22:02
  • Also to get it to recognize that `-` is part of the character class, perhaps this might work: `[^+|!(){}\[\]\^"~*?:\\-]+`. Note that I put the hyphen last in the character class. Should have thought of that before. – DWright Dec 14 '12 at 22:02
  • Take a look at these answers: [1](http://stackoverflow.com/a/3529805/49251) and [2](http://stackoverflow.com/a/3521450/49251), which might help. – DWright Dec 14 '12 at 22:06
  • Thanks for your help, but I'm gonna go with Thumper, apparently there are no speed gain in using regex in c# =/ – Johnny Dec 14 '12 at 22:09
  • That's fine. Although you were asking for regex, which is why I stuck to regex. – DWright Dec 14 '12 at 22:10
  • Yes, I did... I initially thought this would be the answer / the way to go upvoted you for your troubles – Johnny Dec 14 '12 at 22:11
1

Just in case you don't know, regex in C# is much slower than string manipulation: Regex in C#
Yet, you can increase the speed if you optimize it using Regex.Compiled. This does cause your program to start up slower, however. If this is going to be any sort of web-based (C#/Silverlight), I highly recommend using String manipulation and searching over Regex, as it is going to be incredibly-slow for anyone using the page otherwise.

You can easily match Unicode or ASCII codes of characters and accept/deny words from there, with much better performance.

If you are determined to use regex, consider Perl, or other scripting languages, that are much faster with string manipulation using Regex.

Thumper
  • 525
  • 1
  • 6
  • 21
  • Well you convinced me to abandon regex. I'll do this by good ol' string manipulation... – Johnny Dec 14 '12 at 22:08
  • Best of luck! Adding those extra characters might make things a bit messy unless you create your own Verify class, or something :-P – Thumper Dec 14 '12 at 22:16