Fastest way to extract a substring in C#

Question

I'm going to process thousands of strings (with the size of ~150kB on average). Each of them contains zero or more substrings of the following form:

<a href="/link/i/want">Fixed_String</a>

I would like to extract all such links and put them into a list.

Additionally, there is another fixed string after which the strings I am looking for will not appear.

What's the fastest way to get the strings?

Is the "Fixed_String" part always exactly the same? – LukeH Jun 02 '11 at 12:20 — LukeH, Jun 02 '11 at 12:20
@LukeH: Yes, that's what a fixed string is. – Hui Jun 02 '11 at 12:22 — Hui, Jun 02 '11 at 12:22

score 3 · Answer 1 · answered Jun 02 '11 at 12:32

3

Assuming that strings are of properly formatted HTML format, you can easily parse them with XmlReader class which is non-cached and forward only (which makes it very very fast). You just seek for the proper node at retrieve its 'href' attribute's value.

You can also use normal string manipulation like .SubString() but then you'll have to write many subroutines to handle exceptional cases. The point here is to avoid RegEx since it will be the slowest of the bunch.

answered Jun 02 '11 at 12:32

Teoman Soygul

25,584
6
69
80

1

Are you sure about that? I haven't tested, but it doesn't seem so obvious to me that a series of regex matches would be much slower than an XmlReader... No time to test right now, but might play with it later :) – Tao Jun 02 '11 at 12:47
It's not me who's speaking here, it's experience:) RegEx engines are always way slower than forward only and non-cached parsers (about 10x in my experience). Try extracting a single element's attribute from a large document with RegEx vs. `XmlReader.Create(..)` and you'll see... – Teoman Soygul Jun 02 '11 at 13:24

score 3 · Accepted Answer · answered Jun 02 '11 at 12:55

The SubString() Option

As noted by Teoman Soygul, there is a SubString() Option, which I don't know if it is slower or faster as I did not test them side by side.

Now, this is not properly disected into sub methods but should give you the general idea.
I just use a ReadOnlyCollection cause it is what I'm used to when no further manipulation of the list is required. Change it to what ever output list type you like.

The someText variable is most likely going to end up a paramter of GetLinks off course.

public ReadOnlyCollection<string> GetLinks()
{
    string startingText = "href=''";
    string endText = "''>";
    string stopText = "Fixed_String";
    string someText = @"what is this text <a href=''/link/i/want''>somenormallink</a> some random text <a href=''/another link/i/want''>Fixed_String</a> some more radnom txt ";

    List<string> myLinks = new List<string>();

    string[] rawLinks = someText.Split(new string[] { "<a " }, StringSplitOptions.None);

    foreach (string rawLink in rawLinks)
    {
        if (!rawLink.StartsWith(startingText))
        {
            continue;
        }

        myLinks.Add(rawLink.Substring(startingText.Length, rawLink.IndexOf(endText, 1) - startingText.Length));


        if (rawLink.Contains(stopText))
        {
            break;
        }
    }


    return new ReadOnlyCollection<string>(myLinks);
}

That Results in a Collection containing the links:
enter image description here

score 2 · Answer 3 · answered Jun 02 '11 at 12:52

A bit of manual parsing is probably the fastest way to solve this. Regex would be possible, too, because it's really just a very simple case of parsing a link and not a whole HTML document, but that could easily choke on those large files, performance wise.

Now, let me preface this that I haven't tested this at all and I feel kinda dirty posting it (I'm sure it needs some more edge-case checks to avoid errors), but here you go:

    const char[] quotes = new char[] { '"', '\'' };

    private List<string> ExtractLinks(string html)
    {
        var links = new List<string>();
        string searchFor = ">Fixed_String</a>";

        for (int i = html.IndexOf(searchFor); i >= 0; i = html.IndexOf(searchFor, i + searchFor.Length))
        {
            string href = ExtractHref(html, i);
            if (!String.IsNullOrEmpty(href))
                links.Add(href);
        }

        return links;
    }

    private string ExtractHref(string html, int backtrackFrom)
    {
        int hrefStart = -1;

        // Find "<a", but limit search so we don't backtrack forever
        for (int i = backtrackFrom; i > backtrackFrom - 255; i--)
        {
            if (i < 0)
                return null;

            if (html[i] == '<' && html[i + 1] == 'a')
            {
                hrefStart = html.IndexOf("href=", i);
                break;
            }
        }

        if (hrefStart < 0)
            return null;

        int start = html.IndexOfAny(quotes, hrefStart);
        if (start < 0)
            return null;

        int end = html.IndexOfAny(quotes, start + 1);
        if (end < 0)
            return null;

        return html.Substring(start + 1, end - start - 1);
    }

XmlReader is probably a no-go, because you most likely cannot guarantee that those files are XHTML formatted. If you want to do proper parsing, the HTML Agility Pack is probably your best choice, or maybe a properly done Regex if it can't be helped. I posted this manual parsing so you have another alternative you can do a performance test with.

score 1 · Answer 4 · answered Nov 04 '11 at 15:17

Generally Regex is faster with small files. If file size gets bigger (>~60Kb in my experience) then Regex becomes slower (Even static, compiled etc.). Found exact situation described in very good English:

Stripping Out Empty XmlElements in a Performant Way and the Bus Factor

Have fun discovering what is "High Bus Factor". It brought me a good mood for a day.

score 0 · Answer 5 · answered Jun 02 '11 at 12:40

I think in this case where there are strings which are large enough on average and which contains zero or more substrings the best way will be to use Regex class like this:

string anchorPattern = @"<(.|/)a(.|\n)+?>";

foreach (string str in strings)
{
    Regex regex = new Regex(anchorPattern);

    foreach (Match match in regex.Matches(str))
    {
         // do here what you want with substring in match.Value
    }

}

Kieran McCormick · Answer 6 · 2020-10-21T23:28:29.397

From benchmarking, the optimal way of generating a substring is using ReadOnlySpans, not using string.Split

string.Split is much slower and writes a lot to memory while Readonly spans only write to heap.

|                  Method |      Mean |    Error |    StdDev |    Median |  Gen 0 | Gen 1 | Gen 2 | Allocated |
|------------------------ |----------:|---------:|----------:|----------:|-------:|------:|------:|----------:|
|    SpanParseLongVersion |  17.84 ns | 0.385 ns |  0.674 ns |         - |      - |     - |     - |         - |
| ParseLongFWVersionSplit |  95.74 ns | 1.928 ns |  3.274 ns |  95.05 ns | 0.0373 |     - |     - |     176 B |

public const string FWLongVersion= "FWabcdefghijklmnopqrstuvwxyz-1.0000000000001";

[Benchmark]
public void SpanParseLongVersion()
{
    var dashChar = '-';
    var vSpan = FWLongVersion.AsSpan();
    var length = FWLongVersion.Length;

    ReadOnlySpan<char> fwVersion = null;

    for (int x = 0; x < length; x++)
    {
        if (vSpan[x] == dashChar)
        {
            fwVersion = vSpan.Slice(x + 1, 5);
            break;
        }
    }
}

[Benchmark]
public void ParseLongFWVersionSplit()
{
    var x = FWLongVersion.Split('-');
}

also, using string.StartsWith and string.Contains is not great....for how to efficiently check this, please check my post here: https://stackoverflow.com/a/64395744/4926590

Fastest way to extract a substring in C#

6 Answers6