1

Is there a way to remove every special character from a string like:

"\r\n               1802 S St Nw<br>\r\n                    Washington, DC 20009"

And to just write it like:

"1802 S St Nw, Washington, DC 20009"
abatishchev
  • 98,240
  • 88
  • 296
  • 433
Umair A.
  • 6,690
  • 20
  • 83
  • 130
  • 2
    Is this a ["How do I parse HTML with a regex"](http://stackoverflow.com/questions/1732348) question? – dtb Sep 27 '10 at 15:12

6 Answers6

5

To remove special characters:

public static string ClearSpecialChars(this string input)
{
    foreach (var ch in new[] { "\r", "\n", "<br>", etc })
    {
        input = input.Replace(ch, String.Empty);
    }
    return input;
}

To replace all double space with single space:

public static string ClearDoubleSpaces(this string input)
{
    while (input.Contains("  ")) // double
    {
        input = input.Replace("  ", " "); // with single
    }
    return input;
}

You also may split both methods into a single one:

public static string Clear(this string input)
{
    return input
        .ClearSpecialChars()
        .ClearDoubleSpaces()
        .Trim();
}
abatishchev
  • 98,240
  • 88
  • 296
  • 433
  • the question is, it won't remove white spaces good. it can remove whitespace but between words there should be one whitespace remained except others – Umair A. Sep 27 '10 at 14:58
  • What about `"
    "`, `"
    "`, `"
    "`, `"
    "` etc.?
    – dtb Sep 27 '10 at 15:12
  • 1
    @dtb: Enumerate all items to replace. Or use RegEx. But what about don't parse (X)HTML with RegEx? http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – abatishchev Sep 27 '10 at 15:15
  • @Henk: Agree. So we have a dilemma - use looping or use RegEx. I prefer the first, the reason is under the link below :) – abatishchev Sep 27 '10 at 15:21
  • 1
    It's not pretty, but it'll work and it'll be maintainable. I would not worry about the performance so long as this is only used to clean up what users enter, not to scrub large amounts of existing data. – Steven Sudit Sep 27 '10 at 17:14
  • @Steven: I would consider maintainability a reason _not_ to use this. All the Replace variations are declarative. – H H Sep 27 '10 at 17:19
  • 1
    @Henk: Indeed they are, which makes them clearer to most people than RegExp. – Steven Sudit Sep 27 '10 at 17:26
  • 1
    @Henk: Look at the first code block. It explicitly lists each forbidden string, in an array. This is at least as declarative as RegExp code that specifies these strings, and it's a one-to-one declaration with no encoding. This makes it clearer than RegExp, and that's ultimately what matters here. I know you think of RegExp as declarative in that you don't tell it how to do its job, just what to do, but that's not the only possible meaning. And is's not what matters: what matters is that RegExp just isn't clear. – Steven Sudit Sep 27 '10 at 19:09
  • @Steven: That first block has a lot of trouble expressing "any occurrence of 2 or more spaces". Or "Only space after \r\n" or ... And again: It is horribly inefficient. – H H Sep 27 '10 at 19:42
  • @Henk: While I certainly can't claim that abatishchev's code is fully optimized, I'm not sure I follow your example. The first bock doesn't even tackle the issue of double spaces; the second one does. If we wanted it to be faster, we could do a single pass copy in a StringBuilder. – Steven Sudit Sep 27 '10 at 21:45
1

two ways, you can use RegEx, or you can use String.Replace(...)

Muad'Dib
  • 28,542
  • 5
  • 55
  • 68
0

Use the Regex.Replace() method, specifying all of the characters you want to remove as the pattern to match.

Bernard
  • 7,908
  • 2
  • 36
  • 33
  • Actually there is a little more structure than "all of the characters you want to remove" – H H Sep 27 '10 at 15:29
0

You can use the C# Trim() method, look here:

http://msdn.microsoft.com/de-de/library/d4tt83f9%28VS.80%29.aspx

elsni
  • 1,953
  • 2
  • 17
  • 35
0
System.Text.RegularExpressions.Regex.Replace("\"\\r\\n                                                            1802 S St Nw<br>\\r\\n                                                            Washington, DC 20009\"", 
 @"(<br>)*?\\r\\n\s+", "");
H H
  • 263,252
  • 30
  • 330
  • 514
Ruel
  • 15,438
  • 7
  • 38
  • 49
0

Maybe something like this, using ASCII int values. Assumes all html tags will be closed.

public static class StringExtensions
{
    public static string Clean(this string str)
    {   
        string[] split = str.Split(' ');

        List<string> strings = new List<string>();
        foreach (string splitStr in split)
        { 
            if (splitStr.Length > 0)
            {
                StringBuilder sb = new StringBuilder();
                bool tagOpened = false;

                foreach (char c in splitStr)
                {
                    int iC = (int)c;
                    if (iC > 32)
                    {
                        if (iC == 60)
                            tagOpened = true;

                        if (!tagOpened)
                               sb.Append(c);

                        if (iC == 62)
                            tagOpened = false;
                    }
                }

                string result = sb.ToString();   

                if (result.Length > 0)
                    strings.Add(result);
            }
        }

        return string.Join(" ", strings.ToArray());
    }
}
mdm20
  • 4,475
  • 2
  • 22
  • 24