21

I have a regular expression to validate a string. But now I want to remove all the characters that do not match my regular expression.

E.g.

regExpression = @"^([\w\'\-\+])"

text = "This is a sample text with some invalid characters -+%&()=?";

//Remove characters that do not match regExp.

result = "This is a sample text with some invalid characters -+";

Any ideas of how I can use the RegExpression to determine the valid characters and remove all the other ones.

Many thanks

tif
  • 1,109
  • 2
  • 12
  • 32

3 Answers3

23

I believe you can do this (whitelist characters and replace everything else) in one line:

var result = Regex.Replace(text, @"[^\w\s\-\+]", "");

Technically it will produce this: "This is a sample text with some invalid characters - +" which is slightly different than your example (the extra space between the - and +).

emfurry
  • 2,158
  • 2
  • 14
  • 12
  • This will not work if the regex to match the text is more complicated. You can negate every regex expression that easily. – Daniel Hilgarth May 27 '11 at 15:43
  • 1
    True, but the poster has said he/she needs removal on a character level basis for which this should suffice. Further, if you need greater precision consider: `var result = Regex.Replace(text, @"[^\w]", m => "%&=?()".Contains(m.Value) ? "" : m.Value);` You can replace my MatchEvaluator with any code to determine whether or not to keep a character. – emfurry May 27 '11 at 15:53
16

Simple as that:

var match = Regex.Match(text, regExpression);
string result = "";
if(match.Success)
    result = match.Value;

Removing the non-matched characters is the same as keeping the matched ones. That's what we are doing here.

If it is possible that the expression matches multiple times in your text, you can use this:

var result = Regex.Matches(text, regExpression).Cast<Match>()
                  .Aggregate("", (s, e) => s + e.Value, s => s);
Daniel Hilgarth
  • 171,043
  • 40
  • 335
  • 443
  • Hi Daniel, I tried your solution, but as you mentioned my regular expression will match more than once, coz I need it to just remove the invalid characters but keep all the valid ones. I could not use the second piece of code, I get an error in the `Cast() ` Am I supposed to replace that part with something else or I should use your code as you typed it. Thanks – tif May 30 '11 at 08:14
  • (1) The regex you provided is not doing what you expect it to do. (2) What is the error you get? I actually tested that code and it works. – Daniel Hilgarth May 30 '11 at 08:17
  • (1) Why is the RegEx wrong? or how should it be? I use the same RegEx for a similar method that just validates if it is a valid string or not, but this new method instead of returning true if it matched the RegEx, removes/replaces the invalid characters, I guess I need to use two different RegEx as one will not work on both cases right? (2) I forgot to add the add the directive for System.Linq – tif May 30 '11 at 08:56
  • The regex matches **one** word *or* **one** of the following characters: ' - + *at the beginning of the line* – Daniel Hilgarth May 30 '11 at 08:57
  • What's the difference advantage/disadvantage between your approach and @emfurry s approach? Any things I should get into consideration? – tif May 30 '11 at 08:58
  • I already wrote about the problems with emfurry's approach in a comment to his answer – Daniel Hilgarth May 30 '11 at 08:59
  • This one line is beautiful – Basic Coder Jan 18 '17 at 06:55
  • What does `.Cast()` buy you? Don't you already have a `MatchCollection`? – ruffin Mar 22 '19 at 00:35
  • @ruffin: `MatchCollection` only implements `IEnumerable` but not `IEnumerable`, so you can't use it directly in a LINQ expression. – Daniel Hilgarth Mar 25 '19 at 07:46
  • When I F12 in, it looks like `MatchCollection` may have been updated to implement the latter, at least in .NET Core 2.0. So perhaps no longer necessary, depending on your target platform. `public class MatchCollection : ICollection, IEnumerable, ICollection, IEnumerable, IList...` – ruffin Mar 25 '19 at 12:23
3

Thanks to Replace chars if not match answer I've created a helper method to strips unaccepted characters .

The allowed pattern should be in Regex format, expect them wrapped in square brackets. A function will insert a tilde after opening squere bracket. I anticipate that it could work not for all RegEx describing valid characters sets,but it works for relatively simple sets, that we are using.

 /// <summary>
               /// Replaces  not expected characters.
               /// </summary>
               /// <param name="text"> The text.</param>
               /// <param name="allowedPattern"> The allowed pattern in Regex format, expect them wrapped in brackets</param>
               /// <param name="replacement"> The replacement.</param>
               /// <returns></returns>
               /// //        https://stackoverflow.com/questions/4460290/replace-chars-if-not-match.
               //https://stackoverflow.com/questions/6154426/replace-remove-characters-that-do-not-match-the-regular-expression-net
               //[^ ] at the start of a character class negates it - it matches characters not in the class.
               //Replace/Remove characters that do not match the Regular Expression
               static public string ReplaceNotExpectedCharacters( this string text, string allowedPattern,string replacement )
              {
                     allowedPattern = allowedPattern.StripBrackets( "[", "]" );
                      //[^ ] at the start of a character class negates it - it matches characters not in the class.
                      var result = Regex .Replace(text, @"[^" + allowedPattern + "]", replacement);
                      return result;
              }

static public string RemoveNonAlphanumericCharacters( this string text)
              {
                      var result = text.ReplaceNotExpectedCharacters(NonAlphaNumericCharacters, "" );
                      return result;
              }
        public const string NonAlphaNumericCharacters = "[a-zA-Z0-9]";

There are a couple of functions from my StringHelper class http://geekswithblogs.net/mnf/archive/2006/07/13/84942.aspx , that are used here.

           /// <summary>
           /// ‘StripBrackets checks that starts from sStart and ends with sEnd (case sensitive).
           ///           ‘If yes, than removes sStart and sEnd.
           ///           ‘Otherwise returns full string unchanges
           ///           ‘See also MidBetween
           /// </summary>

           public static string StripBrackets( this string str, string sStart, string sEnd)
          {
                  if (CheckBrackets(str, sStart, sEnd))
                 {
                       str = str.Substring(sStart.Length, (str.Length – sStart.Length) – sEnd.Length);
                 }
                  return str;
          }
           public static bool CheckBrackets( string str, string sStart, string sEnd)
          {
                  bool flag1 = (str != null ) && (str.StartsWith(sStart) && str.EndsWith(sEnd));
                  return flag1;
          }
Michael Freidgeim
  • 26,542
  • 16
  • 152
  • 170