4

This is the code:

StringBuilder sb = new StringBuilder();
Regex rgx = new Regex("[^a-zA-Z0-9 -]");

var words = Regex.Split(textBox1.Text, @"(?=(?<=[^\s])\s+\w)");
for (int i = 0; i < words.Length; i++)
{
    words[i] = rgx.Replace(words[i], "");
}

When im doing the Regex.Split() the words contain also strings with chars inside for exmaple:

Daniel>

or

Hello:

or

\r\nNew

or

hello---------------------------

And i need to get only the words without all the signs

So i tried to use this loop but i end that in words there are many places with "" And some places with only ------------------------

And i cant use this as strings later in my code.

joe
  • 8,344
  • 9
  • 54
  • 80
Haim Kashi
  • 399
  • 7
  • 19

3 Answers3

11

You don't need a regex to clear non-letters. This will remove all non-unicode letters.

public string RemoveNonUnicodeLetters(string input)
{
    StringBuilder sb = new StringBuilder();
    foreach(char c in input)
    {
        if(Char.IsLetter(c))
           sb.Append(c);
    }

    return sb.ToString();
}

Alternatively, if you only want to allow Latin letters, you can use this

public string RemoveNonLatinLetters(string input)
{
    StringBuilder sb = new StringBuilder();
    foreach(char c in input)
    {
        if(c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z')
           sb.Append(c);
    }

    return sb.ToString();
}

Benchmark vs Regex

public static string RemoveNonUnicodeLetters(string input)
{
       StringBuilder sb = new StringBuilder();
       foreach (char c in input)
       {
            if (Char.IsLetter(c))
                sb.Append(c);
       }

            return sb.ToString();
}



static readonly Regex nonUnicodeRx = new Regex("\\P{L}");

public static string RemoveNonUnicodeLetters2(string input)
{
     return nonUnicodeRx.Replace(input, "");
}


static void Main(string[] args)
{

    Stopwatch sw = new Stopwatch();

    StringBuilder sb = new StringBuilder();


    //generate guids as input
    for (int j = 0; j < 1000; j++)
    {
        sb.Append(Guid.NewGuid().ToString());
    }

    string input = sb.ToString();

    sw.Start();

    for (int i = 0; i < 1000; i++)
    {
        RemoveNonUnicodeLetters(input);
    }

    sw.Stop();
    Console.WriteLine("SM: " + sw.ElapsedMilliseconds);

    sw.Restart();
    for (int i = 0; i < 1000; i++)
    {
        RemoveNonUnicodeLetters2(input);
    }

    sw.Stop();
    Console.WriteLine("RX: " + sw.ElapsedMilliseconds);


}

Output (SM = String Manipulation, RX = Regex)

SM: 581
RX: 9882

SM: 545
RX: 9557

SM: 664
RX: 10196
keyboardP
  • 68,824
  • 13
  • 156
  • 205
  • You may not *need* regex but don’t you think that a regex one-liner is rather superior to your code? – Konrad Rudolph Jul 03 '13 at 20:43
  • Not necessarily. This method is reusable and can also be called in one line. Also easier to maintain IMHO. – keyboardP Jul 03 '13 at 20:44
  • 1
    … Easier to maintain than a regular expression that clearly and directly expresses the intent? That sounds ludicrous. This is the kind of things regex are *made for*. – Konrad Rudolph Jul 03 '13 at 20:45
  • 3
    I'd wager people would prefer an method name over a regex when it comes to descriptions. Regex, IMHO, are for situations when string parsing is too complicated or verbose. – keyboardP Jul 03 '13 at 20:46
  • 1
    So put the regex in a method. – Konrad Rudolph Jul 03 '13 at 20:47
  • 1
    Fair comment. Maybe I'm in the minority but I try and opt for string manipulation when the requirements are basic. This code isn't typed multiple times, it's typed once and reused. – keyboardP Jul 03 '13 at 20:49
  • 3
    I'm with you. Regular expressions are awesome, but clear code is even better. – Joe Enos Jul 03 '13 at 20:51
  • @Joe I am arguing **for** clean code here. For the record, I actually think the code in this answer is very decent. But regex are getting an unjustified bad rep. They are *awesome* for parsing text on character level – in fact, they are *the* best-suited tool for the job. – Konrad Rudolph Jul 03 '13 at 20:54
  • 1
    @KonradRudolph For me, you can't just solve today's problem. You also have to plan for tomorrow. Suppose tomorrow it changes to "Letters, spaces, exactly two words, where the first word is 3-6 characters, the second word is 4-10 characters, and the whole string is between 10-15 characters with exactly 4 vowels." (ridiculous example, but you get where I'm going). Sure, it may still be a one-liner with a RegEx, but now it's a much longer expression, one that you'll probably have to test thoroughly, and may not be easy to read. Without RegEx, it's a few easy lines of simple readable code. – Joe Enos Jul 03 '13 at 21:04
  • @KonradRudolph But I'm not against RegEx - I just think in general they are much harder to read unless you're already very familiar with them, and much easier to screw up. – Joe Enos Jul 03 '13 at 21:06
  • 1
    I agree that regex sometimes gets avoided unnecessarily but I would add that regex is also more expensive and would actually be slower in this case. Of course, that would be reversed given a different scenario but for basic cleaning such as this I believe string iteration should be extremely fast. – keyboardP Jul 03 '13 at 21:08
  • @keyboardP Slower in this case? If that’s the case (I haven’t checked …!) then .NET’s implementation sucks. Regex performance is another area of pervasive misunderstandings. There is no technical reason why it should be slower than your code. – Konrad Rudolph Jul 03 '13 at 21:13
  • @KonradRudolph - Please let me know if there any mistakes in the benchmarking but it seems that the regex is significantly slower. – keyboardP Jul 03 '13 at 21:25
  • @keyboardP Okay, that’s supremely lame. I don’t see an obvious error in the benchmark – and even making it more robust doesn’t change the result. Precompiling the regular expression also doesn’t help much. – Konrad Rudolph Jul 03 '13 at 21:49
2

keyboardP’s solution is decent – do consider it. But as I’ve argued in the comments, regular expressions are actually the correct tool for the job, you’re just making it unnecessarily complicated. The actual solution is a one-liner:

var result = Regex.Replace(input, "\\P{L}", "");

\P{…} specifies a Unicode character class we do not want to match (the opposite of \p{…}). L is the Unicode character class for letters.

Of course it makes sense to encapsulate this into a method, as keyboardP did. To avoid recompiling the regular expression over again, you should also consider pulling the regex creation out of the actual code (although this probably won’t give a big impact on performance):

static readonly Regex nonUnicodeRx = new Regex("\\P{L}");

public static string RemoveNonUnicodeLetters(string input) {
    return nonUnicodeRx.Replace(input, "");
}
Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
  • 3
    One or two of my colleagues *may* understand `"\\P{L}"`. I've got about 35 colleagues. I must admit, the company does not hire top people, but you may understand what this would do regarding maintainability. I use regular expressions a lot (especially *during* development), but not when they can be easily avoided in important code... – Maarten Bodewes Jul 03 '13 at 21:09
  • @owlstead So put a comment next to it. This is not a valid reason to use the right tool for the job. Instead, you *learn* the tool – or, in your case, educate the coworkers. Yes, the regex is cryptic to the uninitiated but so is the conditional operator, yet there is an overwhelming consensus that you *should* use such idioms. I’m not even sure a comment is the way to go here – the code is entirely self-explanatory … given that a proper documentation of regular expressions exists. – Konrad Rudolph Jul 03 '13 at 21:11
2

To help Konrad and keyboardP resolve their differences, I ran a benchmark test, using their code. It turns out that keyboardP's code is 10x faster than Konrad's code

    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Text;
    using System.Text.RegularExpressions;

    namespace ConsoleApplication1
    {
        class Program
        {
            static void Main(string[] args)
            {
                string input = "asdf234!@#*advfk234098awfdasdfq9823fna943";
                DateTime start = DateTime.Now;
                for (int i = 0; i < 100000; i++)
                {
                    RemoveNonUnicodeLetters(input);
                }
                Console.WriteLine(DateTime.Now.Subtract(start).TotalSeconds);
                start = DateTime.Now;
                for (int i = 0; i < 100000; i++)
                {
                    RemoveNonUnicodeLetters2(input);
                }
                Console.WriteLine(DateTime.Now.Subtract(start).TotalSeconds);
            }
            public static string RemoveNonUnicodeLetters(string input)
            {
                StringBuilder sb = new StringBuilder();
                foreach (char c in input)
                {
                    if (Char.IsLetter(c))
                        sb.Append(c);
                }

                return sb.ToString();
            }
            public static string RemoveNonUnicodeLetters2(string input)
            {
                var result = Regex.Replace(input, "\\P{L}", "");
                return result;
            }
        }
    }

I got

0.12
1.2

as output

UPDATE:

To see if it is the Regex compilation that is slowing down the Regex method, I put the regex in a static variable that is only constructed once.

            static Regex rex = new Regex("\\P{L}");
            public static string RemoveNonUnicodeLetters2(string input)
            {
                var result = rex.Replace(input,m => "");
                return result;
            }

But this had no effect on the runtime.

Mark Lakata
  • 19,989
  • 5
  • 106
  • 123
  • Thanks for the benchmark. I just added mine with fewer loops and a longer string input and the results correlate. – keyboardP Jul 03 '13 at 21:30
  • 1
    Just for fun, if you benchmark a `char[]` instead of a `StringBuilder`, you get even better results by about 10%. (build a temp array of the string's size, loop through the string, populate the temp array with matches, then copy the temp array to a new array of the right size and pass it into the `string` constructor). – Joe Enos Jul 03 '13 at 21:34
  • For more complex regular expressions, extracting the creation actually has a substantial effect on the runtime. I’m still very disappointed at .NET’s regex implementation. The regex should be blazingly fast. FWIW, your benchmark code isn’t very reliable, you should be using `StopWatch`, you should probably use more iterations (although it may just be fine in this case) and you should interlace the test calls to mitigate periodic effects of background processes which may bias the result. Ideally you’d also plot the distributions to ensure that no outliers skew the means. – Konrad Rudolph Jul 03 '13 at 22:08
  • `StopWatch` is only more accurate if you are looking at timings faster than 15 ms. `DateTime.Now` is good to 15 ms. I ran 100x longer than that. 100K iterations is plenty. I just bumped it up until I got about 1 second of run time -- that's billions of cycles of CPU. I ran the code twice with the same results and then posted the results. – Mark Lakata Jul 03 '13 at 23:46
  • 2
    @Joe Enos, If you preallocated the StringBuilder with the size of the string (new StringBuilder(string.Length)), you would probably get a similar 10% speedup since the StringBuilder is allocating 16 bytes, then 32 then 64 as it is running. It's a great speedup for StringBuilder that I almost never see online. – PRMan Aug 20 '15 at 18:00