1

Lets say we have two strings

    str1 = "021A,775U,000A,021A,1U2,206B,240,249,255B,260,263B,280,294,2U1,306B,336B,345,427,440,442,474,477,4U4,500,508,523,543L,580,584,772,802,P55,VB" ;

    str2 = "772+240+802+P55+263B+1U2+VB";

and want to know if every substring of str2 splitted by "+" is in str1.

Is there something "better" than doing this

    string[] str1Arr = str1.Split(',');
    foreach (var s in str2.Split('+'))
    {
        if (!str1Arr.Contains(s))
        {
             return false;
        } 
    }
    return true;

str2 would return true, while

   str3 ="772+240+802+P55+263B+1U2+V";

would return false.

Edit:

Max length of str1 is ~1000 with all delimited strings being unique and sorted. Max lenght of str2 is ~700 (average 200) with substrings being unique but not sorted.

Here are some more strings:

"200B+201B+202+202B+203B+204B+205+205B+206+206B+207+207B+208+208B+209+209B+210+210B+211B+214B+216B+460+494+498+830+379"//false
"612+637+638+646+649+655+664+678+688+693+694+754+756+758+774+778+780+781+787+99R+B58+RQW+RRG+RTJ+RVI+RVJ+RVU+RVV"//false
"255B+2U1+775U+336B+P55+802"//true
"220+223+224+229+230+233+234+402+537+550+811+863+872+881+889+965+512L+516L+520L+521L+524L+526L+528L+581L+810L+812L+814L+816L+818L+830"//false
"255B"//true
jigga
  • 558
  • 4
  • 16
  • `Is there something "better"` - same, but with a `HashSet` instead of the `string[]`. – GSerg Oct 29 '22 at 18:12
  • It helps to provide more context around why you think your code is not fast enough. _Fastest_ is a very subjective concept around here that will attract opinionated responses until you constrain this to a very specific scenario. If you are calling this inside a loop then we can make it faster by computing arrays once outside of the loop. – Chris Schaller Oct 29 '22 at 19:00
  • The list that you are comparing against, `str1` is that fixed or does it also vary frequently? – Chris Schaller Oct 30 '22 at 08:21
  • For these strings it is fixed. But there are about 40 million unique str1. In general I'm doing a cross product of 40 mill unique str1 and 0.5 mill unique str2 – jigga Oct 30 '22 at 08:52

2 Answers2

2

tldr;

Do not split str1 into an array, just use String.Contains() to find the match.

string str1Fixed = "," + str1 + ",";
foreach (var s in str2.Split('+'))
{
    if (!str1Fixed.Contains("," + s + ","))
         return false;
}
return true;

For a small number of elements string Contains is going to be more efficient, especially in the case positive matches that are close to the start of the string.

This was my very first thought before I provided my previous answer but it was a gut feeling not supported by any empirical evidence. I also deleted that original answer because it didn't take into account exact matches (I didn't inject the commas into the search args)


Update: .Net 6 - String Interpolation

It turns out that using string interpolation can save a few memory allocations, so it will even further enhance this solution:

string str1Fixed = $",{str1},";
foreach (var s in str2.Split('+'))
{
    if (!str1Fixed.Contains($",{s},"))
         return false;
}
return true;

The difference is marginal, but if you are executing a significant amount of iterations, its all going to add up.


But don't take my word for it, let us prove it!

Firstly, I will introduce you to our test cases, all of which have been tested to ensure compliance:

public bool Arrays(string optionalValuesCSV, string requiredValuesPSV)
{
    string[] str1Arr = optionalValuesCSV.Split(',');
    foreach (var s in requiredValuesPSV.Split('+'))
    {
        if (!str1Arr.Contains(s))
            return false;
    }
    return true;
}

public bool ConcatenatedString(string optionalValuesCSV, string requiredValuesPSV)
{
    string str1Fixed = "," + optionalValuesCSV + ",";
    foreach (var s in requiredValuesPSV.Split('+'))
    {
        if (!str1Fixed.Contains("," + s + ","))
             return false;
    }
    return true;
}

public bool FormattedString(string optionalValuesCSV, string requiredValuesPSV)
{
    string str1Fixed = String.Format(",{0},", optionalValuesCSV);
    foreach (var s in requiredValuesPSV.Split('+'))
    {
        if (!str1Fixed.Contains(String.Format(",{0},", s)))
             return false;
    }
    return true;
}

public bool InterpolatedString(string optionalValuesCSV, string requiredValuesPSV)
{
    string str1Fixed = $",{optionalValuesCSV},";
    foreach (var s in requiredValuesPSV.Split('+'))
    {
        if (!str1Fixed.Contains($",{s},"))
             return false;
    }
    return true;
}

public bool LinqExcept(string optionalValuesCSV, string requiredValuesPSV)
{
    return !requiredValuesPSV.Split('+')
        .Except(optionalValuesCSV.Split(','))
        .Any();
}

public bool LinqHashsetExcept(string optionalValuesCSV, string requiredValuesPSV)
{
    return !requiredValuesPSV.Split('+')
        .Except(new HashSet<string>(optionalValuesCSV.Split(',')))
        .Any();
}
            
public bool RegexMatchIterator(string optionalValuesCSV, string requiredValuesPSV)
{
    Regex searchPattern = null;
    foreach (var s in requiredValuesPSV.Split('+'))
    {
        searchPattern = new Regex("," + s + ",");
        if (!searchPattern.IsMatch("," + optionalValuesCSV + ","))
             return false;
    }
    return true;
}       

public bool RegexMatchAllPattern(string optionalValuesCSV, string requiredValuesPSV)
{
    // note: need to escape \b for word boundary
    var pattern = "(?=.*?\\b" + requiredValuesPSV.Replace("+","\\b)(?=.*?\\b") + "\\b)";
    Regex searchPattern = new Regex(pattern);
    //return searchPattern.IsMatch("," + optionalValuesCSV + ","))
    return searchPattern.IsMatch(optionalValuesCSV);
}   

Most of this is from the thought process and comments to my previous answer:

  • Arrays
    The OP logic to test against

  • ConcatenatedString
    Use String.Contains() on the str1 value instead of splitting it into an array of strings.

  • FormattedString
    Same logic as ConcatenatedString but uses String.Format instead of concatenating values,

    • I know from previous experience and analysis that concatenation is more efficient for very small numbers of templated arguments, but it is included here as a proof so you don't have to take my word for it.
  • InterpolatedString
    Same logic as ConcatenatedString but uses String Interpolation instead of concatenating values,

    • I've never tested the performance of this, I hope it is faster than String.Format.
    • I can't test this on dotNetFiddle for .Net 4.7 because the compiler doesn't support it, we can't test with .Net 6 or & because SQL Server CLR only supports .Net Framework, so this method is excluded from the current tests :(
  • LinqExcept This is the logic explained in my previous answer using arrays and LINQ Except()

  • LinqHashsetExcept
    Same as LinqExcept but uses a HashSet<string> instead of an array, the purpose of this is to remove duplicates. Its isn't relevant any more as the OP has been updated to state the items in the arrays will already be unique. It is an interesting case to leave in to see what overheads might be involved.

  • RegexMatchIterator
    Use Regex.IsMatch() instead of String.Contains, other than that, same setup as the ConcatenatedString approach.

  • RegexMatchAllPattern
    RegEx matching by changing the required string into a regular expression, no array iterations in this solution.

You might find these additional resources helpful regarding RegEx, many of the comments talk through why and when you might choose one over the other.:

Before we get started on some performance metrics, you will notice I have not included any tests relying on sorting the data. My expectation of this test set is that we will prove the effort it takes to split the strings into arrays and then iterating through the array to find the result will take longer than a simple String.Contains or String.IndexOf. There is an interesting discussion on the associated time complexities here.

NOTE: The following tests are executed on https://dotnetfiddle.net/ using the >NET 4.7.2 compiler. The actual numbers will be different in your operating environment but it is probably a good comparison to how CLR UDFs are executed from within SQL Server.

  • It is a good comparison because the IO load is variable and shared across other isntances, at any point in time in a live SQL database there will be a varying number of other concurrent queries that will affect performance.
  • Ignore the specific values, but look at the patterns they form.

The first benchmark test is to run each of these methods once for the original values of str1 and str2, capturing the time it takes to execute a cold start then after executing 10 times we sample the next execution:

Method Cold Start (ms) Warm (ms)
Arrays 0.6484 0.0088
ConcatenatedString 0.0069 0.0024
FormattedString 0.1939 0.0029
LinqExcept 0.0104 0.0036
LinqHashsetExcept 0.2009 0.0052
RegexMatchIterator 0.0389 0.0216
RegexMatchAllPattern 0.0929 0.0879

So far Concatenated String wins, but this isn't conclusive yet, I'm only running on dotnetfiddle, and each time I execute the test set the numbers come back different, but with a very similar distribution. We need to validate the test logic and identify base line execution patterns before we start to vary the inputs. We do that by running the same evaluation for different numbers of iterations.

The following data is captured from this fiddle: https://dotnetfiddle.net/QRPRSu

Mean Execution Time (ms) 1 10 100 1000 10000
Arrays 0.0088 0.0036 0.003464 0.0034006 0.00300586
ConcatenatedString 0.0024 0.00256 0.002541 0.0024637 0.00230981
FormattedString 0.0029 0.00309 0.00308 0.0030005 0.00288773
LinqExcept 0.0036 0.00442 0.004649 0.0050969 0.00436625
LinqHashsetExcept 0.0052 0.00605 0.006088 0.0061603 0.00599868
RegexMatchIterator 0.0216 0.02257 0.058507 0.022021 0.02383395
RegexMatchAllPattern 0.0879 0.08973 0.087446 0.0904722 0.08860092

So I didn't expect this, but for the 100 iteration cycle there is a spike, presumably there is a <10:100 event that occurs which causes the execution time to spike, testing in the epochs less than or greater than 100 seem to smooth out this spike. Across all the iterations the values are still relative to each other but also very similar, enough that we don't need to try for a massive number of iterations with the same parameters to obtain consistent results.

Baseline Benchmarking results

On to the actual testing, this time using the new comparison strings:

    string str1 = "021A,775U,000A,021A,1U2,206B,240,249,255B,260,263B,280,294,2U1,306B,336B,345,427,440,442,474,477,4U4,500508,523,543L,580,584,772,802,P55,VB" ;
    string str2 = "772+240+802+P55+263B+1U2+VB";
    string str3 = "772+240+802+P55+263B+508+VB"; // false
    string str4 = "200B+201B+202+202B+203B+204B+205+205B+206+206B+207+207B+208+208B+209+209B+210+210B+211B+214B+216B+460+494+498+830+379";//false
    string str5 = "612+637+638+646+649+655+664+678+688+693+694+754+756+758+774+778+780+781+787+99R+B58+RQW+RRG+RTJ+RVI+RVJ+RVU+RVV";//false
    string str6 = "255B+2U1+775U+336B+P55+802";//true
    string str7 = "220+223+224+229+230+233+234+402+537+550+811+863+872+881+889+965+512L+516L+520L+521L+524L+526L+528L+581L+810L+812L+814L+816L+818L+830";//false
    string str8 = "255B";//true

Run this fiddle: https://dotnetfiddle.net/sFqleR

Method TestCases Iterations TotalTests Overall Milliseconds ColdStart Mean StdDev Variance
Arrays 7 1000 7000 00:00:00.0255972 25.5972 0.8965 0.00356150000000027 0.0495346716008814 0.049538210170847
ConcatenatedString 7 1000 7000 00:00:00.0152603 15.2603 0.1847 0.00208621428571429 0.00935358891914583 0.00935425710423217
FormattedString 7 1000 7000 00:00:00.0161606 16.1606 0.2099 0.00220965714285706 0.0040865872450029 0.00408687917537046
LinqExcept 7 1000 7000 00:00:00.0308686 30.8686 3.8193 0.00430554285714287 0.00879058428853723 0.00879121225469787
LinqHashsetExcept 7 1000 7000 00:00:00.0433723 43.3723 2.821 0.00609560000000002 0.00892103745152035 0.00892167473676254
RegexMatchIterator 7 1000 7000 00:00:00.0816964 81.6964 0.5562 0.0115590857142857 0.0318308417422667 0.0318331156174521
RegexMatchAllPattern 7 1000 7000 00:00:07.0529764 7052.9764 1.0156 1.00735934285714 1.07445298146738 1.07452973633274

ConcatenatedString still wins, but not by much over a String.Format approach.

  • It is interesting that note that the variance for String.Format is the lowest, that indicates that it is the most consistent over the test cases.

Now lets try a full outer join between the optional and required values for a bit of variation. Then for fun lets reverse the strings and randomise the values to account for any bias in the sequence of the strings.

Fiddle: https://dotnetfiddle.net/m6YQv9

Method TestCases Iterations TotalTests Overall Milliseconds ColdStart Mean StdDev Variance
Arrays 1024 10 10240 00:00:00.0188368 18.8368 0.2294 0.00173652343749998 0.0128300901366091 0.0128307166517418
ConcatenatedString 1024 10 10240 00:00:00.0157159 15.7159 0.2115 0.00143642578125 0.00719377438896546 0.00719412567320958
FormattedString 1024 10 10240 00:00:00.0164970 16.497 0.2815 0.00151858398437492 0.0038494732983341 0.00384966127466541
LinqExcept 1024 10 10240 00:00:00.0239897 23.9897 0.2688 0.00223395507812505 0.00772608535444904 0.00772646263234338
LinqHashsetExcept 1024 10 10240 00:00:00.0292562 29.2562 0.2896 0.00274775390625008 0.00598778336636879 0.0059880757600192
RegexMatchIterator 1024 10 10240 00:00:00.0800746 80.0746 0.4529 0.00772122070312506 0.0508215684930702 0.0508240501967361
RegexMatchAllPattern 1024 10 10240 00:00:03.6379600 3637.96 0.5773 0.355110078125 0.378996464474246 0.379014971516495

Comparison of the different testing methodologies

Overall ConcatenatedString still wins, but this time the margin is even smaller, and the FormattedString solution is still the most consistent, which might indicate that for a very large set of data, it could be the most reliable.

Also have a look at the original Array based solution, that is only marginally more expensive than the FormattedString but has a comparatively high degree of variance, which means this result set might have been lucky.

  • The last test only had 10 iterations of each combination due to runtime execution constraints, so it's not the most reliable data.

I had dreams of generating a random set of terms that fit the patterns described and then injecting a random number of the terms but TBH i'm out of time, it was a fun adventure, but its chewed up my sunday and now its getting late for dinner.

When looking for performance tuning, you need to have some good metrics around what you are currently experiencing and a realistic idea of both what is acceptable and how much time you are willing to pay for this effort. As an example, I bet OP has already spent a number of hours on this task, add my 8 hours today and the collective hours of others who have read this post and considered it, then this solution is well over 12 hours cost to the team.

I always defer back to this xkcd comic: Is It Worth the Time?

Is It Worth the Time?

Don't forget the time you spend finding the chart to look up what you save. And the time spent reading this reminder about the time spent. And the time trying to figure out if either of those actually make sense. Remember, every second counts toward your life total, including these right now.

The difference between the Array method and the ConcatenatedString is still very marginal, you would need a very large dataset to really observe any difference that this implementation can provide.

We've spent over 12 hours, so if this makes your query run 5 seconds faster, and you run this over 5 times a day, then in 5 years you break even with the time you spend working on this issue vs the cumulative time the user would have spent waiting for the query to complete if you have done nothing.

When working with SQL and CLR functions, there are other optimisations that you can do in the SQL database that can significantly improve the performance, CLR functions are hard for SQL to optimise, generally for a query that involves this sort of data it is easy to assume that improvements in the rest of the structure of the query or evaluating other filtering constraints before this type of lookup would save more time than what we have saved here today.

A Final Thought...
If the str1 value is pretty consistent for all the items in the list, then an XML or JSON comparison inside the SQL can be far more efficient, you can reduce the cycles for each row by half by pre-formatting the str1 value as a constant variable for the query, but you lose any benefit to this if you try to pass that data across the boundary into CLR.

If this is frequently queried data used in search functions, similar to tags in tweets and other social media posts, then there are better architectural solutions to this problem, the simplest of which is to create a table of the individual terms so we can index them. The good news about SQL is that this implementation detail can be entirely managed in the database with triggers, that way the application and business domain logic do not need to be affected by this structural optimisation.

As you are using CLR, you should read over CLR vs. T-SQL: Performance considerations for some useful background. It's an older resource and SQL has evolved, but it's still relevant thinking.


Update: .Net 6

So after dinner, it's time to check out .Net 6. I had started to writeup a great summary, because a couple of test runs showed that the original Arrays implementation was fastest for fixed values over 100K interations, which means all of this was potentially for nothing, but I can't replicate that result now :) Ultimately that means that this comparison is so fast with either of these options that you are better off looking elsewhere for optimisations if you need them.

Fiddle: https://dotnetfiddle.net/VoMeo9

Fixed input test in .Net 6

Test leader board 1

If we graph those numbers, excluding the RegEx results and the HashSet you can see that over time, the average execution time converges to a much smaller range and the LinqExcept implementation even beats the formatted string implementation. This is interesting when you compare this to later results because it probably points to some special optimisations within dot net when we run the same calculation repeatedly.

String Interpolation is consistently close to the Concatenated String approach and even beats it twice in this series.

Chart of results in .Net 6

The final test is to re-visit the random results feed to see if Interpolation can get the win that I was hoping for:

Fiddle: https://dotnetfiddle.net/4N23s5

Method TestCases Iterations TotalTests Overall Milliseconds ColdStart Mean StdDev Variance
Arrays 1024 80 81920 00:00:00.0856801 85.6801 0.2904 0.000990795898437515 0.00214697922130186 0.002146992325543057
ConcatenatedString 1024 80 81920 00:00:00.0637065 63.7065 0.1513 0.0007239416503907266 0.0014517292239806258 0.00145173808471375
FormattedString 1024 80 81920 00:00:00.0732739 73.2739 0.3548 0.0008382824707032207 0.0016765147809772755 0.0016765250137051203
InterpolatedString 1024 80 81920 00:00:00.0614540 61.454 0.2699 0.0006977844238282229 0.0017730743291170337 0.0017730851512029848
LinqExcept 1024 80 81920 00:00:00.1139557 113.9557 0.8458 0.0013408093261717685 0.007293846685190374 0.007293891203705162
LinqHashsetExcept 1024 80 81920 00:00:00.1413452 141.3452 0.349 0.0016796301269533514 0.001957306707138949 0.0019573186537003933
RegexMatchIterator 1024 80 81920 00:00:00.4237531 423.7531 4.8922 0.005117004394530726 0.018711766419816867 0.018711880628421194
RegexMatchAllPattern 1024 80 81920 00:00:08.2265704 8226.5704 1.1987 0.10035106689453285 0.10451169495402168 0.10451233284862492

It's the slimmest of margins... We'd call that a Bee's D!#@ from where I'm from... Except that this time we are measuring, and these numbers a significantly smaller than that (its ~4800um if you were wondering)

I had to try a few times to get the results I was looking for, due to the test cases being randomised, it makes sense that we'd get a different output with different executions. Here's the results from two observations:

Results of Randomised testin in .Net 6

Everyone loves a chart, so here's the final one:

Mean Execution time for Random tests in .Net 6

Chris Schaller
  • 13,704
  • 3
  • 43
  • 81
  • just beautiful! I wished everyone here would answer the way you did! You helped me a lot, thank you so much! – jigga Oct 30 '22 at 11:05
  • @jigga I wanted to know about interpolation, seeing that you were worried about the concatenation all those hours ago ;) But also have a read of [CLR vs. T-SQL: Performance considerations](https://aboutsqlserver.com/2013/07/22/clr-vs-t-sql-performance-considerations/) for some useful background. It's an older resource and SQL has evolved, but it's still relevant thinking. – Chris Schaller Oct 30 '22 at 16:03
1

It depends on the length of the strings, but you could get a minor optimisation by using a HashSet to ensure that there are only unique entries in the first array

HashSet<string> str1Values = new HashSet<string>(str1.Split(','));
foreach (var s in str2.Split('+'))
{
    if (!str1Values.Contains(s))
    {
         return false;
    } 
}
return true;

Or we can do the same in LINQ:

HashSet<string> str1Values = new HashSet<string>(str1.Split(','));
return str2.Split('+').All(s => str1Values.Contains(s));

We can also use LINQ Except() to do a similar comparison, with Except() we want no results, this would indicate that all the values matched.

return !str2.Split('+')
            .Except(new HashSet<string>(str1.Split(',')))
            .Any();

You could also package this into an extension method to reduce some of the ambiguity:

public static class StringExtensions
{
    /// <summary>Check if all the elements in this string exist in the comparison</summary>
    /// <param name="listOfAllValues">the CSV string that contains all the possible values</param>
    /// <param name="requriedElements">CSV of the values that all MUST exist in the source list</param>
    /// <param name="requiredElementsDelimiter">[Optional] delimiter to use for the requriedElements string</param>
    public static bool ContainsAll(this string listOfAllValues, string requiredElements, char requiredElementsDelimiter = '+')
    {
        return !requiredElements.Split(requiredElementsDelimiter)
            .Except(new HashSet<string>(listOfAllValues.Split(',')))
            .Any();
    }
}

Execute it like this:

Console.WriteLine(str1.ContainsAll(str2));
Console.WriteLine(str1.ContainsAll(str2, '+'));

See this fiddle: https://dotnetfiddle.net/45tU6z It is semantically the same as your code and will break on the first non-matching value.


For small lists you could skip the array for the first string, and just use contains, it might not make a measurable difference though:

string str1Fixed = "," + str1 + ",";
foreach (var s in str2.Split('+'))
{
    if (!str1Fixed.Contains("," + s + ","))
    {
         return false;
    } 
}
return true;

Or in LINQ

var str1Fixed = "," + str1 + ",";
return str2.Split('+').Select(s => "," + s + ",").All(s => str1Values.Contains(s));

NOTE: I had to add leading and trailing commas to ensure we get exact matches

I wouldn't resort to regular expressions for simple exact matching like this and suspect that we only need to split str1 into an array if there was going to be either a large number of comparisons (many elements in str2) or a very large number of elements in str1, at which point we could sort the first string and do some further optimisations, but you don't need to go that far until you can demonstrate performance issues.

The other consideration is object reuse. If this is in a comparison that is called many times inside another iterative block or loop, then there might be some additional optimisations, especially if one of the comparisons is relatively static.

For significantly large lists you could index the larger array by the first letter and sort lexicographically... But you really need to have a good justification for taking it this far.

Ideally by using the LINQ comparisons like Except we are assuming that someone smarter than us has applied practical optimisations for us so that we can concentrate on expressing our logic and worry less about the performance considerations.

Chris Schaller
  • 13,704
  • 3
  • 43
  • 81
  • Thank you so much for your answer! I'm looking for a faster way to get the same result I'm getting... Maybe a better algorithm... Do you think, that running my code and looking for the substring directly in str1 (no splitting to an array) would speed things up? – jigga Oct 29 '22 at 19:03
  • That was my first thought to be honest, but _why_ comes to mind, knowing the external scenario will help. I'll show a variant that doesn't use an array. – Chris Schaller Oct 29 '22 at 19:04
  • Tbh it's part of an SQL Server CLR. There is a business case where I'm getting a kind of customer data (=str1) and str2 is our business logic. There are ~40 million unique str1 and about 0.5 million str2. Str2 are changing every ~3 months. Total duration is 44 hours. And the company wants to have a sqlclr for the computation... Don't ask me why – jigga Oct 29 '22 at 19:14
  • Oh MY GOSH, stop right there, SQL comparisons for something this simple are way faster than doing this in CLR, unless your strings are _very_ large and _very_ different go back and ask how to do this in SQL, its much easier there, but also we optimise it in a different way because you are doing this in a set. – Chris Schaller Oct 29 '22 at 19:20
  • Your additional code: adding a ',' at the start and end, this makes two concatinations. Will it still be faster than using an array? Just theoretically – jigga Oct 29 '22 at 19:21
  • As mentioned earlier, I would love to do this in SQL, but I'm not allowed to do so... – jigga Oct 29 '22 at 19:25
  • Not allowed to? You have rights to add a CLR function but not a user defined function or to modify the query... If you were concerned about performance in SQL, then use a highly performant SQL solution first before going into CLR as a general rule. Please provide metrics though like an example of the different value sets, not just 2 values, but to meke it much faster will require you to restructure your query instead of trying to make a lame horse run. – Chris Schaller Oct 29 '22 at 19:29
  • 2 concatenations should be a lot faster than splitting into an array, The number of elements and duplicates will determine if using an array improves things. So try and let us know which one worked. If you provide a better dataset to test with I can do better analysis, but only you have to full dataset to analyse, so you're in the best position to do the testing. – Chris Schaller Oct 29 '22 at 19:32
  • Just edited the question. Hope that helps :) – jigga Oct 30 '22 at 07:56
  • Thanks again! With your help I went down to 31h! Just for the record: how would such a query look like doing the same without c#? – jigga Nov 05 '22 at 18:00
  • Ah if you're waiting 31h for a query, I would structure the data differently. Use triggers to split the strings into an M:N _tag_ table and have a table of master _tags_. It's a standard way to manage _hashtags_ in SQL. The the query would then take minutes or less with good indexing, but I'll post a direct SQL example. What version are you using? – Chris Schaller Nov 05 '22 at 18:16
  • Sql server 2019 – jigga Nov 05 '22 at 18:30
  • You forgot about me (crying) xD – jigga Nov 07 '22 at 18:16
  • Sorry, TBH my test database is still generating the 40m rows. I'll stop it and run with what I've got, but I'm suspecting a structural approach will be better. – Chris Schaller Nov 07 '22 at 20:48