75

For the hope-to-have-an-answer-in-30-seconds part of this question, I'm specifically looking for C#

But in the general case, what's the best way to strip punctuation in any language?

I should add: Ideally, the solutions won't require you to enumerate all the possible punctuation marks.

Related: Strip Punctuation in Python

Community
  • 1
  • 1
Tom Ritter
  • 99,986
  • 30
  • 138
  • 174
  • Different languages are, in fact, different, and I don't think there's an answer to the question you're asking. You could ask about specific languages, or what language would be best for that sort of manipulation. – David Thornley Jun 17 '10 at 17:23

16 Answers16

120
new string(myCharCollection.Where(c => !char.IsPunctuation(c)).ToArray());
Hath
  • 12,606
  • 7
  • 36
  • 38
GWLlosa
  • 23,995
  • 17
  • 79
  • 116
25

Why not simply:

string s = "sxrdct?fvzguh,bij.";
var sb = new StringBuilder();

foreach (char c in s)
{
   if (!char.IsPunctuation(c))
      sb.Append(c);
}

s = sb.ToString();

The usage of RegEx is normally slower than simple char operations. And those LINQ operations look like overkill to me. And you can't use such code in .NET 2.0...

Hades32
  • 914
  • 2
  • 9
  • 12
  • Note that this approach also lets you replace punctuation with (for example) whitespace. Useful for tokenizing. –  Jan 16 '14 at 21:19
15

Describes intent, easiest to read (IMHO) and best performing:

 s = s.StripPunctuation();

to implement:

public static class StringExtension
{
    public static string StripPunctuation(this string s)
    {
        var sb = new StringBuilder();
        foreach (char c in s)
        {
            if (!char.IsPunctuation(c))
                sb.Append(c);
        }
        return sb.ToString();
    }
}

This is using Hades32's algorithm which was the best performing of the bunch posted.

Brian Low
  • 11,605
  • 4
  • 58
  • 63
14

Assuming "best" means "simplest" I suggest using something like this:

String stripped = input.replaceAll("\\p{Punct}+", "");

This example is for Java, but all sufficiently modern Regex engines should support this (or something similar).

Edit: the Unicode-Aware version would be this:

String stripped = input.replaceAll("\\p{P}+", "");

The first version only looks at punctuation characters contained in ASCII.

Joachim Sauer
  • 302,674
  • 57
  • 556
  • 614
9

You can use the regex.replace method:

 replace(YourString, RegularExpressionWithPunctuationMarks, Empty String)

Since this returns a string, your method will look something like this:

 string s = Regex.Replace("Hello!?!?!?!", "[?!]", "");

You can replace "[?!]" with something more sophiticated if you want:

(\p{P})

This should find any punctuation.

Anton
  • 1,387
  • 2
  • 17
  • 30
6

This thread is so old, but I'd be remiss not to post a more elegant (IMO) solution.

string inputSansPunc = input.Where(c => !char.IsPunctuation(c)).Aggregate("", (current, c) => current + c);

It's LINQ sans WTF.

Nick Vaccaro
  • 5,428
  • 6
  • 38
  • 60
4

Based off GWLlosa's idea, I was able to come up with the supremely ugly, but working:

string s = "cat!";
s = s.ToCharArray().ToList<char>()
      .Where<char>(x => !char.IsPunctuation(x))
      .Aggregate<char, string>(string.Empty, new Func<string, char, string>(
             delegate(string s, char c) { return s + c; }));
Tom Ritter
  • 99,986
  • 30
  • 138
  • 174
2

If you want to use this for tokenizing text you can use:

new string(myText.Select(c => char.IsPunctuation(c) ? ' ' : c).ToArray())
Chris Marisic
  • 32,487
  • 24
  • 164
  • 258
2

The most braindead simple way of doing it would be using string.replace

The other way I would imagine is a regex.replace and have your regular expression with all the appropriate punctuation marks in it.

TheTXI
  • 37,429
  • 10
  • 86
  • 110
2

Here's a slightly different approach using linq. I like AviewAnew's but this avoids the Aggregate

        string myStr = "Hello there..';,]';';., Get rid of Punction";

        var s = from ch in myStr
                where !Char.IsPunctuation(ch)
                select ch;

        var bytes = UnicodeEncoding.ASCII.GetBytes(s.ToArray());
        var stringResult = UnicodeEncoding.ASCII.GetString(bytes);
JoshBerke
  • 66,142
  • 25
  • 126
  • 164
  • Why the `IEnumerable` to array to bytes to string conversion, why not just `new String(s.ToArray())`? Or is that what new string will do under the hood anyway? – Chris Marisic Aug 24 '11 at 12:38
2

For anyone who would like to do this via RegEx:

This code shows the full RegEx replace process and gives a sample Regex that only keeps letters, numbers, and spaces in a string - replacing ALL other characters with an empty string:

//Regex to remove all non-alphanumeric characters
System.Text.RegularExpressions.Regex TitleRegex = new 
System.Text.RegularExpressions.Regex("[^a-z0-9 ]+", 
System.Text.RegularExpressions.RegexOptions.IgnoreCase);

string ParsedString = TitleRegex.Replace(stringToParse, String.Empty);

return ParsedString;
Azametzin
  • 5,223
  • 12
  • 28
  • 46
1

I faced the same issue and was concerned about the performance impact of calling the IsPunctuation for every single check.

I found this post: http://www.dotnetperls.com/char-ispunctuation.

Accross the lines: char.IsPunctuation also handles Unicode on top of ASCII. The method matches a bunch of characters including control characters. By definiton, this method is heavy and expensive.

The bottom line is that I finally didn't go for it because of its performance impact on my ETL process.

I went for the custom implemetation of dotnetperls.

And jut FYI, here is some code deduced from the previous answers to get the list of all punctuation characters (excluding the control ones):

var punctuationCharacters = new List<char>();

        for (int i = char.MinValue; i <= char.MaxValue; i++)
        {
            var character = Convert.ToChar(i);

            if (char.IsPunctuation(character) && !char.IsControl(character))
            {
                punctuationCharacters.Add(character);
            }
        }

        var commaSeparatedValueOfPunctuationCharacters = string.Join("", punctuationCharacters);

        Console.WriteLine(commaSeparatedValueOfPunctuationCharacters);

Cheers, Andrew

Andrew
  • 7,848
  • 1
  • 26
  • 24
0

For long strings I use this:

var normalized = input
                .Where(c => !char.IsPunctuation(c))
                .Aggregate(new StringBuilder(),
                           (current, next) => current.Append(next), sb => sb.ToString());

performs much better than using string concatenations (though I agree it's less intuitive).

Shay Ben-Sasson
  • 188
  • 2
  • 5
0
$newstr=ereg_replace("[[:punct:]]",'',$oldstr);
0

This is simple code for removing punctuation from strings given by the user

Import required library

    import string

Ask input from user in string format

    strs = str(input('Enter your string:'))

    for c in string.punctuation:
        strs= strs.replace(c,"")
    print(f"\n Your String without punctuation:{strs}")
-1
#include<string>
    #include<cctype>
    using namespace std;

    int main(int a, char* b[]){
    string strOne = "H,e.l/l!o W#o@r^l&d!!!";
    int punct_count = 0;

cout<<"before : "<<strOne<<endl;
for(string::size_type ix = 0 ;ix < strOne.size();++ix)   
{   
    if(ispunct(strOne[ix])) 
    {
            ++punct_count;  
            strOne.erase(ix,1); 
            ix--;
    }//if
}
    cout<<"after : "<<strOne<<endl;
                  return 0;
    }//main