A bit of manual parsing is probably the fastest way to solve this. Regex would be possible, too, because it's really just a very simple case of parsing a link and not a whole HTML document, but that could easily choke on those large files, performance wise.
Now, let me preface this that I haven't tested this at all and I feel kinda dirty posting it (I'm sure it needs some more edge-case checks to avoid errors), but here you go:
const char[] quotes = new char[] { '"', '\'' };
private List<string> ExtractLinks(string html)
{
var links = new List<string>();
string searchFor = ">Fixed_String</a>";
for (int i = html.IndexOf(searchFor); i >= 0; i = html.IndexOf(searchFor, i + searchFor.Length))
{
string href = ExtractHref(html, i);
if (!String.IsNullOrEmpty(href))
links.Add(href);
}
return links;
}
private string ExtractHref(string html, int backtrackFrom)
{
int hrefStart = -1;
// Find "<a", but limit search so we don't backtrack forever
for (int i = backtrackFrom; i > backtrackFrom - 255; i--)
{
if (i < 0)
return null;
if (html[i] == '<' && html[i + 1] == 'a')
{
hrefStart = html.IndexOf("href=", i);
break;
}
}
if (hrefStart < 0)
return null;
int start = html.IndexOfAny(quotes, hrefStart);
if (start < 0)
return null;
int end = html.IndexOfAny(quotes, start + 1);
if (end < 0)
return null;
return html.Substring(start + 1, end - start - 1);
}
XmlReader
is probably a no-go, because you most likely cannot guarantee that those files are XHTML formatted. If you want to do proper parsing, the HTML Agility Pack is probably your best choice, or maybe a properly done Regex if it can't be helped. I posted this manual parsing so you have another alternative you can do a performance test with.