0

I have a list of stopwords that need to be removed from a string.

List<string> stopwordsList = stopwords.getStopWordList();
string text = PDF.getText();
foreach (string stopword in stopwordsList)
{
   text = text.Replace(stopword, "");
}
PDF.setText(text);

..in debug I can see the stopwordsList is being populated correctly, but it seems like the text.Replace() is having no effect whatsoever.

What am I doing wrong?

edit: Note I have also tried text.Replace() on its own, rather than text = text.Replace(). Neither work.

Guru Stron
  • 102,774
  • 10
  • 95
  • 132
John 'Mark' Smith
  • 2,564
  • 9
  • 43
  • 69
  • what does the getText function return? – Max Dec 10 '13 at 12:39
  • 8
    Cannot reproduce your issue. – ken2k Dec 10 '13 at 12:41
  • Have you debugged it and checked what stopword is in each iteration of the foreach loop? I'm pretty sure those are incorrect, because the code looks fine otherwise. – Tobberoth Dec 10 '13 at 12:42
  • are you only checking the result of text from the PDF objects setText method? the replace looks like it should work – Kevin Nacios Dec 10 '13 at 12:42
  • 3
    Is it possible that your stopwords list strings and your text string from PDF are different cases? Like "morning" and "Morning"? – Boluc Papuccuoglu Dec 10 '13 at 12:43
  • 1
    Could provide the sample values of text and stopwords ? – Boluc Papuccuoglu Dec 10 '13 at 12:49
  • 1
    Could you try posting a sample of the actual data that you're working with? I think a few people (including myself) have tested your code and it appears to do what you expect. – Paul Michaels Dec 10 '13 at 12:49
  • This looks all fine to me unless the `text` doesn't match with the values contains in the list. I think you might need to see [C# Case Insenstive String Replace](http://www.codeproject.com/Articles/10890/Fastest-C-Case-Insenstive-String-Replace) – huMpty duMpty Dec 10 '13 at 12:52
  • 1
    What your stop words look like? Imagine that you want to remove "a" (indefinite article) in the text "A simple replace is as good as a mistake". If you just remove "a" with Replace you'll have "A simple replce is s good s mistke" that's incorrect. It seems that you should use regular expressions here. – Dmitry Bychenko Dec 10 '13 at 12:53

2 Answers2

4

Though i don't think there is anything wrong with your code, but i would do something like this.

string someText = "this is some text just some dummy text Just text";
List<string> stopwordsList = new List<string>() { "some", "just", "text" };    
someText = string.Join(" ", someText.Split().Where(w => !stopwordsList.Contains(w, StringComparer.InvariantCultureIgnoreCase)));

you can ignore the StringComparer.InvariantCultureIgnoreCase part if casing is important though.

Note I have also tried text.Replace() on its own, rather than text = text.Replace()

you should know that Replace function returns string which should be handled if you want the updated string. so you are essentially doing it right now. i.e. text = text.Replace()

Ehsan
  • 31,833
  • 6
  • 56
  • 65
  • wondering the same. @huMptyduMpty – Ehsan Dec 10 '13 at 12:55
  • 1
    In the absence of any other input from the OP, I too am inclined to assume case sensitivity to be the culprit. By the way, I would use a Hashset instead of a List in case the stopwords list is significantly large. – Boluc Papuccuoglu Dec 10 '13 at 12:56
  • By the way, your code assumes that the PDF source is like: "Now is the winter of our discontent, made glorious summer by this sun of York", and not "NowisthewinterofourdiscontentmadeglorioussummerbythissunofYork" . – Boluc Papuccuoglu Dec 10 '13 at 13:17
2

There is one catch, though... All previous solutions do not take into account word boundaries. For example, word 'hell' might be a bad word, but word 'hello' is perfectly valid. Also, replacement should be done only on full words otherwise you can get weird results.

Here is code that takes word boundaries into account:

var text = "Hello world, this is a great test!";
var badWords = new List<string>()
{
    "Hello", 
    "great"
};

var wordMatches = Regex.Matches(text, "\\w+")
    .Cast<Match>()
    .OrderByDescending(m => m.Index);

foreach (var m in wordMatches)
    if (badWords.Contains(m.Value))
        text = text.Remove(m.Index, m.Length);

Debug.WriteLine(text);
Kaspars Ozols
  • 6,967
  • 1
  • 20
  • 33
  • nice anyway this work better ;) : text = text.Remove(m.Index, m.Length + 1); – Dragouf Jan 23 '14 at 10:56
  • 1
    Not always. You are assuming that there is a [space] after the word to be removed, but it could be also some kind of punctuation mark what gives a full meaning to sentence (period, question mark, exclamation, etc). It would be smarter to process string as in original answer and after that just remove duplicate spaces. There are a plenty samples how to do that. One is here: [link](http://stackoverflow.com/questions/206717/how-do-i-replace-multiple-spaces-with-a-single-space-in-c) – Kaspars Ozols Jan 23 '14 at 16:04