0

I'm working on a book encryption program for one of my courses and I've run into a problem. Our professor gave us the example of using say Pride and Prejudice as the book used to encrypt, so I chose that one to test my program. The current function I'm using to remove the punctuation from the string is taking so long that the program is being forced into break mode. This function works for smaller strings even pages long, but when I fed it Pride and Prejudice it takes way to long.

public void removePunctuation(ref string s) {
    string result = "";
    for (int i = 0; i < s.Length; i++) {
        if (Char.IsWhiteSpace(s[i])) {
            result += ' ';
        } else if (!Char.IsLetter(s[i]) && !Char.IsNumber(s[i])) {
            // do nothing
        } else {
            result += s[i];
        }
    }
    s = result;
}

So I think I need a faster way to remove punctuation from this string if anyone has any suggestions? I know looping through every character is horrible, but I'm stumped and I was never taught Regex in depth.

Edit: I was asked how I was storing the string in the dictionary class! This is the constructor for another class that actually uses the formatted string.

        public CodeBook(string book)
    {
        BookMap = new Dictionary<string, List<int>>();
        Key = book.Split(null).ToList(); // split string into words
        foreach(string s in Key)
        {
            if (!BookMap.Keys.Contains(s))
            {
                BookMap.Add(s, Enumerable.Range(0, Key.Count).Where(i => Key[i] == s).ToList());
                // add word and add list of occurrances of word
            }
        }
    }
D. Christ
  • 17
  • 3
  • 3
    String is immutable. You have to use a StringBuilder to avoid excessive memory usage. – Hans Passant Feb 11 '18 at 17:58
  • It's not going to do much about your performance, but why not get rid of the empty if bu using `if (Char.IsLetter(s[i]) || Char.IsNumber(s[i])){result.Append(s[i]);} `? – oerkelens Feb 11 '18 at 18:12
  • Possible duplicate of [Remove punctuation from string with Regex](https://stackoverflow.com/questions/5871793/remove-punctuation-from-string-with-regex) – Camilo Terevinto Feb 11 '18 at 18:18
  • @oerkelens That is something I did not think about thank you. My professor hates unnecessary code. – D. Christ Feb 11 '18 at 18:31
  • that second part is a question on its own. `String.Split` takes an array of char, an argument list of char, or an array of string. Passing `null` is not a valid option. You are converting all white space to spaces before, so probably splitting on spaces is what you want. – Cee McSharpface Feb 12 '18 at 08:29
  • @dlatikay passing null to `String.Split` is a valid option and it splits on spaces. The second portion of code works, I'm just running into the same problem I was having in the beginning where it is taking too long to execute. It's taking every string in `CodeBook.Key` and adding it to a dictionary with a Value that is a `List` of the occurrences of that string. It's going through about 121,000 words, but since it only adds them once the dictionary ends up being about 7000 in count. – D. Christ Feb 12 '18 at 23:00

3 Answers3

3

This is slow because you construct string by concatenations in a loop. You have several approaches that are more performant:

  • Use StringBuilder - unlike string concatenation which constructs a new object each time you add a character, this approach expands the string under construction by larger chunks, preventing excessive garbage creation.
  • Use LINQ's filtering with Where - this approach constructs an array of chars in a single shot, then constructs a single string from it.
  • Use regular expression's Replace - this method is optimized to deal with strings of virtually unlimited sizes.
  • Roll your own algorithm - create an array of chars that corresponds to the length of the original string. Walk through the string, and add the characters that you wish to keep to the array. Use string's constructor that takes the array, the initial index, and the length to construct the string at once.
Sergey Kalinichenko
  • 714,442
  • 84
  • 1,110
  • 1,523
1

Looping through every character once is not that bad. You're doing it all in one pass, that's not trivial to avoid.

The problem lies in the fact that the framework will need to allocate a new copy of the (partial) string whenever you do something like

result += s[i];

You can avoid that by introducing a StringBuilder documented here to append non-punctuation characters as you go.

public string removePunctuation(string s) 
{
    var result = new StringBuilder();
    for (int i = 0; i < s.Length; i++) {
        if (Char.IsWhiteSpace(s[i])) {
            result.Append(" ");
        } else if (!Char.IsLetter(s[i]) && !Char.IsNumber(s[i])) {
            // do nothing
        } else {
            result.Append(s[i]);
        }
    }
    return result.ToString();
}

You could further reduce the number of necessary Append calls with a refined algorithm, for example look ahead to the next punctuation and append larger portions at once, or use an existing string manipulation library like RegEx. But the introduction of StringBuilder above should give you a noticable performance gain already.

I was never taught Regex in depth

Use the search provider of your choice, you may end up with a tested solution which you can just study and use: https://stackoverflow.com/a/5871826/1132334

Cee McSharpface
  • 8,493
  • 3
  • 36
  • 77
0

You can use Regex to remove punctuations as below.

public string removePunctuation(string s) 
{
    string result = Regex.Replace(s, @"[^\w\s]", "");
    return result;
}

^ Means: not these characters (letters, numbers).
\w Means: word characters.
\s Means: space characters.

Camilo Terevinto
  • 31,141
  • 6
  • 88
  • 120
Sampath
  • 1,173
  • 18
  • 29