C# get a certain part of a string for multiple occurences in a string

Question

So I am trying to grab member profile links from a forum and display them in a console app. What I want to do is grab all the links from the webpage and print them out.

Current I am getting the page source like so:

String source = WebClient.DownloadString("URL");

What I want to do is iterate through that string and find every string like this:

<h3 class='ipsType_subtitle'>
         <strong><a href='http://www.website.org/community/user/8416-unreal/' title='View Profile'>!Unreal</a></strong>
</h3>

Then once i get that part, I want to get the url like so:

http://www.website.org/community/user/8416-unreal/

Current this is the code I have tried, it works. But only grabs one of the links:

    WebClient c = new WebClient();
    String members = c.DownloadString("http://www.powerbot.org/community/members/");
    int times = Regex.Matches(members, "<h3 class='ipsType_subtitle'>").Count;
    Console.WriteLine(times.ToString());

    for (int i = 1; i < times; i++)
    {
        try
        {
            int start = members.IndexOf("<h3 class='ipsType_subtitle'>");
            members = members.Substring(start, 500);
            String[] next = members.ToString().Split(new string[] { "a href='" }, StringSplitOptions.None);
            String[] link = next[1].Split(' ');
            Console.WriteLine(link[0].Replace("'", ""));
        }
        catch(Exception e) { Console.WriteLine("Failed: " + e.ToString()); }
    }

    Console.Read();

Thanks.

One option (not necessarily the most efficient) is to use [regular expressions](http://www.regular-expressions.info/dotnet.html) and retrieve the url by using capturing groups — mgibsonbr, May 28 '12 at 11:02
Sigh... [parsing HTML with regex](http://stackoverflow.com/a/1732454/1583) :( — Oded, May 28 '12 at 11:02
@Oded if the structure is constant, it doesn't matter if the original language is regular or not. If there is greater variance on what you want to match, then yes, I agree completely — mgibsonbr, May 28 '12 at 11:04
(please don't take my last comment as an encouragement for using this technique - in general it's a very bad idea; but I keep my statement that, if your use case is limited and you know what you're doing it can be a simpler and faster way of extracting info from a text without requiring a full parse) — mgibsonbr, May 28 '12 at 11:20
@mgibsonbr - I agree that a constant structure means that a regex is more likely to work, but HTML content can be very irregular, even if the same "template" is used across the board, and a regex may fail even then. — Oded, May 28 '12 at 11:21

score 1 · Accepted Answer · answered May 28 '12 at 11:14

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(members);

var links = doc.DocumentNode
    .Descendants("h3")
    .Where(h => h.Attributes["class"] != null && h.Attributes["class"].Value == "ipsType_subtitle")
    .Select(h => h.Descendants("a").First().Attributes["href"].Value)
    .ToArray();

score 0 · Answer 2 · answered May 28 '12 at 11:01

0

The better way is to use HTML Agility Pack

answered May 28 '12 at 11:01

Asif Mushtaq

13,010
3
33
42

score 0 · Answer 3 · answered May 28 '12 at 11:02

Most correct way in pasing HTML is using HTML parser, like HtmlAgilityPack. You can not correctly pass HTML page in other way.

The proove of this are "balanced parentesis" concept. You can not parse ((x)) string with regular expression, cause you need to remember a parse tree, but regular expressions are state-less constructs.

They are not bad, but just not suitable for these type of parsing.

Hope this helps.

score 0 · Answer 4 · answered May 28 '12 at 11:24

Below you can find your code, to which i made some changes, and now it should be ok. But certainly you didn't choose the best method for this task.

WebClient c = new WebClient();
String members = c.DownloadString("http://www.powerbot.org/community/members/");
int times = Regex.Matches(members, "<h3 class='ipsType_subtitle'>").Count;
Console.WriteLine(times.ToString());

var member = string.Empty;//extracted value

for (int i = 1; i < times; i++)
{
    try
    {
        int start = members.IndexOf("<h3 class='ipsType_subtitle'>");
        member = members.Substring(start, 500);

        members = members.Remove(start, 500);

        String[] next = member.ToString().Split(new string[] { "a href='" }, StringSplitOptions.None);
        String[] link = next[1].Split(' ');
        Console.WriteLine(link[0].Replace("'", ""));
    }
    catch(Exception e) { Console.WriteLine("Failed: " + e.ToString()); }
}

Console.Read();

C# get a certain part of a string for multiple occurences in a string

4 Answers4