0

I'm looking for ideas and/or improvements of how to loop through a very large `List` in C#.

I have a List<string> named excludes that contains over 4 million strings.
I have a string named message which is a string generated by user input.

I need to check if the user input (message) contains any of the strings in the list (excludes) and if so, remove the matched string(s) found in the string.

Example:

private static List<string> excludes = new List<string>(); // --> 4 million entries
string message = "Hello, how are you this fine day?\r\nMy name is SO."; // User input

foreach (string exclude in excludes)
{
    if (message.Contains(exclude))
    {
        message = message.Replace(exclude, "");
    }
}

On avarage this process takes about 350ms to complete because of the size of the list.
Is there anyway I could improve my code and reduce the time required to complete this task?

Persson
  • 422
  • 4
  • 21
  • Other way could be using a specialized storage like ElasticSearch: store your whole array in there, find matches with your current message, run your loop now on a considerably smaller amount of words. – Andriy Shevchenko Oct 21 '21 at 22:41
  • Is there any chance I could obtain your 4-million entries? Is it a publicly available list? – IOrlandoni Oct 21 '21 at 22:41
  • What these entries are? Stop-words or something similar? If so, you could store those entries in a dictionary (`HashSet`), split a message into array of words and then search for each of them in that dictionary. – Dmitry Oct 21 '21 at 22:45
  • Use an overload of the [Contains](https://learn.microsoft.com/en-us/dotnet/api/system.string.contains?view=net-5.0#System_String_Contains_System_String_System_StringComparison_) method with the `StringComparison.Ordinal` parameter - this will work much faster. – Alexander Petrov Oct 21 '21 at 22:47
  • The fastest processing through a `List` is with a `for` (since we are saving milliseconds). – NetMage Oct 21 '21 at 22:52
  • 1
    You could try to do the .Contains() using parallel loops, each with say 1 million strings, have each parallel loop return a list of matches (0 or more each), then process those result lists sequentially. – Peter B Oct 21 '21 at 22:54
  • 1
    @IOrlandoni I suppose it is, in a way how ever it has been somewhat edited to only contain what I need. Here it is in JSON format. https://drive.google.com/file/d/1YlayuviqxhT9u9XaJaJR6h2Mo1NNqEN6/view?usp=sharing – Persson Oct 21 '21 at 22:56
  • Oh, and also see this: https://en.wikipedia.org/wiki/Scunthorpe_problem – Peter B Oct 21 '21 at 22:59
  • Are there any special characteristics of the list you could potentially take advantage of, or is it truly a `Contains` search? e.g. if the exclusion were all words, you could split the input into words (for some Regex definition of word) and use a `HashSet` for the exclusion and test each word of the input against the `HashSet` – NetMage Oct 21 '21 at 23:02
  • @AlexanderPetrov I'll try that and get back to you! – Persson Oct 21 '21 at 23:07
  • @PeterB Very interesting idea. I'll test this as well. As for the Scunthorpe problem, my "huge" list only contains emote codes for Twitch and YouTube and they *should* be safe for almost all languages hehe. – Persson Oct 21 '21 at 23:08
  • @Persson What do you mean "emote codes"? If you are talking about Unicode emoji codes, you can't use ordinal comparisons. – NetMage Oct 21 '21 at 23:16
  • Sorry, I made a mistake, the `Contains` method performs an ordinal comparison by default. So no additional parameter is needed. – Alexander Petrov Oct 21 '21 at 23:19
  • @NetMage No unicode emoji codes, those strings are plain text "random words" that are later converted into images / emotes on services like Twitch and YouTube. – Persson Oct 21 '21 at 23:23
  • It looks like Twitch/YouTube use `:` to delimit emote codes - do all the emote codes in the list start and end with `:`? – NetMage Oct 21 '21 at 23:47
  • @NetMage Sadly no, Twitch for instance allows users to create their own images/emotes with custom codes, no special characters, spaces or digits. Example: `LULWut` is an emote on Twitch. – Persson Oct 21 '21 at 23:59
  • Presumably you must have spaces or some punctuation, otherwise you couldn't write `LULWut27` without getting the emote? – NetMage Oct 22 '21 at 00:00
  • @NetMage You are correct, `LULWut27` would not show an emote, `LULWut 27` (with a space) would print the emote plus `27` but what I mean is that the custom emotes created by users can not contain any special characters or spaces, thus the strings in the list I got does not contain any delimiters or special characters. Maybe I misunderstood you ^^ – Persson Oct 22 '21 at 00:13
  • Your file seems to have some issues, e.g. ,`"cilienjuCitsd"troh2oh2HEE""` or `,"mrpo1odtte",ITS"queeniaztewing",y"kp01murp4",jeradjSkesup",` which don't look valid? Possibly even `,"$#@%@#^@$joshtest1HI",`? – NetMage Oct 22 '21 at 00:31
  • Another possible error `,"c","ytute","kt,"kor5eat","cshlenda",aydouPoug",oxxdisEnergio"g",`? Also, if `c` is supposed to be valid, that seems problematic. – NetMage Oct 22 '21 at 00:53

4 Answers4

2

Disclaimer: I haven't looked at the word list-it didn't load on my cellphone. I imagine, from the comments, it's like a csv list of 4 million "words"; the sort of "words" you might see if you had a million monkeys bashing on a million typewriters, probably because that's some corollary to how they were actually generated

In the comments you seemed to indictate that the words in the input string are separated by spaces otherwise they don't show up as emojis. As such I'd load the 4 million exclusions into a hashset, split the string into words, remove words in the hashset then recombine the result:


private static HashSet<string> excludes = new HashSet<string>(); // --> 4 million entries, load it once


string message = "Hello, how are you this fine day?\r\nMy name is SO."; // User input

var bits = input.Split(' ');

for (int x = 0; x < bits.Length; x++)
{
    if (exclude.Contains(bits[i]))
    {
        bits[i] = null;
    }
}

var result = string.Join(" ", bits);

This just splits on space, then it knows it can recompose it using space. If your input will have other characters (I can see you have an \r\n there which would defeat this) then you probably want to look at splitting but keeping the delimiters so that you can get your split tokens, replace some, then do a string.Concat instead of join. If you want to wring every last millisecond out of it, then you probably need to look at shifting the solution to Spans but this simple start might provide you with something to investigate whether it's an avenue worth pursuing.

All in I think it's important to tokenize your input and then check entries in their entirety rather than just perform naive repeated replacements, because if your word list contains "hel", "lo", "orl" and "wd" then "hello world" will be reduced to just the separating space even though it contains none of those words. This is because eg replacing "orl" in "world" with nothing creates "wd" which never existed. Also important to note that if the replacements were performed in a different order you'd get a different result ("hello" would disappear but "world" would become "wd" if the "orl" replacement was done last)

Worth noting that hashset is by default case sensitive. do t think you said whether your LULwuts are case sens or not. If they're not, make the hashset case insens

Caius Jard
  • 72,509
  • 5
  • 49
  • 80
1

I'd look at a Bloom filters concept. It works almost at O(1) and uses very few memory.

For example, there's is a code sample in C# and a NuGet package.

But pay attention it's a probabilistic data structure and not always yields to a correct result.

Andriy Shevchenko
  • 1,117
  • 6
  • 21
1

Here are a few performance boosts, for which I'll give timings on my machine.

  • Just try a replacement, without checking if it's contained
foreach(var exclude in _list)
{
    message = message.Replace(exclude, "");
}
  • You could parallelize it.
    This will hopefully speed it up by a factor the number of cores you have, although high numbers of replacements will slow it down, due to locking.
var lockObject = new object();
Parallel.ForEach(excludes, exclude => {
    if (message.Contains(exclude))
    {
        lock(lockObject)
        {
            message = message.Replace(exclude, "");
        }
    }
});
  • You could parallelize it and just replace. Bit more complicated, as it requires a spin-lock in case of replacement
Parallel.ForEach(_list, exclude => {
    string message2;
    string message3;
    do
    {
        message2 = message;
        message3 = message2.Replace(exclude, "");
        if(object.ReferenceEquals(message2, message3))
            return;
            
    }while(Interlocked.CompareExchange(ref message, message3, message2) != message2);
});
Method Timings (sec)
ContainReplace original 0.43
JustReplace 0.3
ParallelContainReplace 0.12
ParallelJustReplace 0.1
Charlieface
  • 52,284
  • 6
  • 19
  • 43
0

You could load the strings into a Trie. Looking up a word to see if it is an excluded word would take at most N comparisons where N is the length of the word.

Here are some resources:

https://www.geeksforgeeks.org/trie-insert-and-search/

https://en.wikipedia.org/wiki/Trie

harleybl
  • 729
  • 6
  • 11