6

I need to remove words from a string based on a set of words:

Words I want to remove:

DE DA DAS DO DOS AN NAS NO NOS EM E A AS O OS AO AOS P LDA AND

If I receive a string like:

EDIT: This string is already "cleaned" from any symbols

THIS IS AN AMAZING WEBSITE AND LAYOUT

The result should be:

THIS IS AMAZING WEBSITE LAYOUT

So far I have:

public static string StringWordsRemove(string stringToClean, string wordsToRemove)
{
    string[] splitWords = wordsToRemove.Split(new Char[] { ' ' });

    string pattern = "";

    foreach (string word in splitWords)
    {
        pattern = @"\b" + word + "\b";
        stringToClean = Regex.Replace(stringToClean, pattern, "");
    }

    return stringToClean;
}

But it's not removing the words, any idea?

I don't know if I'm using the most eficient way to do it, maybe put the words in a array just to avoid spliting them all the time?

Thanks

Patrick
  • 2,995
  • 14
  • 64
  • 125

7 Answers7

9
private static List<string> wordsToRemove =
    "DE DA DAS DO DOS AN NAS NO NOS EM E A AS O OS AO AOS P LDA AND".Split(' ').ToList();

public static string StringWordsRemove(string stringToClean)
{
    return string.Join(" ", stringToClean.Split(' ').Except(wordsToRemove));
}

Modification to handle punctuations:

public static string StringWordsRemove(string stringToClean)
{
    // Define how to tokenize the input string, i.e. space only or punctuations also
    return string.Join(" ", stringToClean
        .Split(new[] { ' ', ',', '.', '?', '!' }, StringSplitOptions.RemoveEmptyEntries)
        .Except(wordsToRemove));
}
Fung
  • 3,508
  • 2
  • 26
  • 33
  • but, what if `stringToClean` has punctuation? – Jodrell Jul 16 '13 at 14:29
  • Hi, thanks for your help. I have choose your answer for been the faster, with a no iteration's solution. Regards. – Patrick Jul 16 '13 at 15:04
  • what about all the punctuation like `"`, `£`, `$`, `%`, `^`, `&`, `(`, `)`, `-`, `_`, `+`, `=`, `[`, `]`, `{`, `}`, `:`; `;`, `@`, `#`, `~` etc. etc. – Jodrell Jul 16 '13 at 15:14
  • @Jodrell, If you have a very limited set, you can plug them all in the modified verion's `Split()` call, though the OP said he has removed them from the input already. For the sake of discussion, I'd suggest to solve the problem in 2 steps: 1) preprocess the string to remove any punctuations, 2) tokenize and remove the unwanted words. For 1), you can check the answer in [here](http://stackoverflow.com/questions/421616/how-can-i-strip-punctuation-from-a-string). – Fung Jul 16 '13 at 15:25
  • @Patrick, I did a performance test on my system, with your test data, this Linq method is about 4x faster that the Regex approach in my answer. +1 from me. Test code available if anyboy is interested. I'd suspect there might be some variation as `stringToClean` grows but that wasn't the question. – Jodrell Jul 16 '13 at 15:55
  • I had to add .ToArray() at the end of the call for it to work. return string.Join(" ", stringToClean.Split(' ').Except(wordsToRemove).ToArray()); – wirble Oct 13 '15 at 16:34
1

I just changed this line

pattern = @"\b" + word + "\b";

to this

pattern = @"\b" + word + @"\b"; //added '@' 

and I got the result

THIS IS AMAZING WEBSITE LAYOUT

and it would be better if you use String.Empty instead of "" like:

stringToClean = Regex.Replace(stringToClean, pattern, String.Empty);
Shaharyar
  • 12,254
  • 4
  • 46
  • 66
1

I used LINQ

string exceptions = "DE DA DAS DO DOS AN NAS NO NOS EM E A AS O OS AO AOS P LDA AND";
string[] exceptionsList = exceptions.Split(' ');

string test  ="THIS IS AN AMAZING WEBSITE AND LAYOUT";
string[] wordList = test.Split(' ');

string final = null;
var result = wordList.Except(exceptionsList).ToArray();
final = String.Join(" ",result);

Console.WriteLine(final);
Lotok
  • 4,517
  • 1
  • 34
  • 44
  • 1
    That's beautifully done! Just as explicit and accurate as functional programming should be! – Viktor Mellgren Jul 16 '13 at 14:20
  • however, if the `stringToClean` contains word boundries that are not spaces, like `',', '.', '?', '"', ...` you are in a world of pain. Note, this set of word boundries is large and growing. – Jodrell Jul 16 '13 at 14:25
  • more feedback then: Just do ``return String.Join(" ",result);`` – Viktor Mellgren Jul 16 '13 at 14:25
  • Hi, thanks for your help. I have choose @Fung's answer for been the faster, with a no iteration's solution. Regards. – Patrick Jul 16 '13 at 15:04
0

Output you get "THIS IS AMAZING WEBSITE LAYOUT".

I was getting an issue where by it was leaving the word "D" (so it was THIS IS AN AMAZING WEBSITE D LAYOUT) in the result because if you use replace it replaces only a certain part of the word. This removed the entire word if the characters you defined are detected (I imagine this is what you want?).

        string[] tabooWords = "DE DA DAS DO DOS AN NAS NO NOS EM E A AS O OS AO AOS P LDA AND".Split(' ');
        string text = "THIS IS AN AMAZING WEBSITE AND LAYOUT";
        string result = text;

        foreach (string word in text.Split(' '))
        {
            if (tabooWords.Contains(word.ToUpper()))
            {
                int start = result.IndexOf(word);
                result = result.Remove(start, word.Length);
            }
        }
Dr Schizo
  • 4,045
  • 7
  • 38
  • 77
  • won't this strip all the `A`s, `E`s and `O`s etc? – Jodrell Jul 16 '13 at 14:37
  • Hi, thanks for your help. I have choose your answer for been the faster, with a no iteration's solution and that I can user with any WordsToRemoveStrin. Regards. – Patrick Jul 16 '13 at 15:26
0
public static string StringWordsRemove(string stringToClean, string wordsToRemove)
{
    string[] splitWords = wordsToRemove.Split(new Char[] { ' ' });
    string pattern = " (" + string.Join("|", splitWords) + ") ";
    string cleaned=Regex.Replace(stringToClean, pattern, " ");
    return cleaned;
}
Anderung
  • 31
  • 3
0

how about,

// make a pattern to match all words 
var pattern = string.Format(
    @"\b({0})\b",
    string.Join("|", wordsToremove.Split(new[] { ' ' })));

// pattern will be of the form "\b(badword1|badword2|...)\b"

// remove all the bad words from the string in one go.    
var cleanString = Regex.Replace(stringToClean, pattern, string.Empty);

// normalise the white space in the string (one space at a time)
var normalisedString = Regex.Replace(cleanString, @"\s+", " ");

The first line makes a pattern that matches any of the words to remove. The second line replaces them all at once which saves needless iteration. The third line normalises the white space in the string.

Jodrell
  • 34,946
  • 5
  • 87
  • 124
  • Functionality is important but so is readability. You should consider your formatting. Less isn't always more. – Lotok Jul 16 '13 at 14:47
  • @Jodrell Hi, thanks! But I'm getting the blank spaces between the words remaning. Any ideas? Regards. – Patrick Jul 16 '13 at 14:51
  • @Patrick, thats because only the word is being replaced not the spaces. Like in your example. – Jodrell Jul 16 '13 at 15:00
  • @Patrick, I've added a third line to normalise the whitespace. – Jodrell Jul 16 '13 at 15:09
  • Hi, thanks for your help. I have choose Fung's answer for been the faster with a functional solution. Regards. – Patrick Jul 16 '13 at 15:29
0

Or...

stringToClean = Regex.Replace(stringToClean, @"\bDE\b|\bDA\b|\bDAS\b|\bDO\b|\bDOS\b|\bAN\b|\bNAS\b|\bNO\b|\bNOS\b|\bEM\b|\bE\b|\bA\b|\bAS\b|\bO\b|\bOS\b|\bAO\b|\bAOS\b|\bP\b|\bLDA\b|\bAND\b", String.Empty);
stringToClean = Regex.Replace(stringToClean, "  ", String.Empty);
James R.
  • 822
  • 8
  • 17
  • 2
    erm, why not type `@"\b(DE|DA|DAS|DO|DOS|AN|NAS|NO|NOS|EM|E|A|AS|O|OS|AO|OS|P|LDA|AND)\b"` – Jodrell Jul 16 '13 at 14:35
  • @Jodrell - Because, that would be too easy. :) Thanks. – James R. Jul 16 '13 at 14:52
  • Hi, thanks for your help. I have choose Fung's answer for been the faster, with a no iteration's solution and that I can use with any WordsToRemoveString. Regards. – Patrick Jul 16 '13 at 15:25