1

So I am trying to grab member profile links from a forum and display them in a console app. What I want to do is grab all the links from the webpage and print them out.

Current I am getting the page source like so:

String source = WebClient.DownloadString("URL");

What I want to do is iterate through that string and find every string like this:

<h3 class='ipsType_subtitle'>
         <strong><a href='http://www.website.org/community/user/8416-unreal/' title='View Profile'>!Unreal</a></strong>
</h3>

Then once i get that part, I want to get the url like so:

http://www.website.org/community/user/8416-unreal/

Current this is the code I have tried, it works. But only grabs one of the links:

    WebClient c = new WebClient();
    String members = c.DownloadString("http://www.powerbot.org/community/members/");
    int times = Regex.Matches(members, "<h3 class='ipsType_subtitle'>").Count;
    Console.WriteLine(times.ToString());

    for (int i = 1; i < times; i++)
    {
        try
        {
            int start = members.IndexOf("<h3 class='ipsType_subtitle'>");
            members = members.Substring(start, 500);
            String[] next = members.ToString().Split(new string[] { "a href='" }, StringSplitOptions.None);
            String[] link = next[1].Split(' ');
            Console.WriteLine(link[0].Replace("'", ""));
        }
        catch(Exception e) { Console.WriteLine("Failed: " + e.ToString()); }
    }

    Console.Read();

Thanks.

Duncan Palmer
  • 2,865
  • 11
  • 63
  • 91
  • 1
    [What have you tried](http://whathaveyoutried.com)? – Oded May 28 '12 at 10:59
  • One option (not necessarily the most efficient) is to use [regular expressions](http://www.regular-expressions.info/dotnet.html) and retrieve the url by using capturing groups – mgibsonbr May 28 '12 at 11:02
  • 5
    Sigh... [parsing HTML with regex](http://stackoverflow.com/a/1732454/1583) :( – Oded May 28 '12 at 11:02
  • @Oded if the structure is constant, it doesn't matter if the original language is regular or not. If there is greater variance on what you want to match, then yes, I agree completely – mgibsonbr May 28 '12 at 11:04
  • (please don't take my last comment as an encouragement for using this technique - in general it's a very bad idea; but I keep my statement that, if your use case is limited and you know what you're doing it can be a simpler and faster way of extracting info from a text without requiring a full parse) – mgibsonbr May 28 '12 at 11:20
  • 1
    @mgibsonbr - I agree that a constant structure means that a regex is more likely to work, but HTML content can be very irregular, even if the same "template" is used across the board, and a regex may fail even then. – Oded May 28 '12 at 11:21

4 Answers4

1
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(members);

var links = doc.DocumentNode
    .Descendants("h3")
    .Where(h => h.Attributes["class"] != null && h.Attributes["class"].Value == "ipsType_subtitle")
    .Select(h => h.Descendants("a").First().Attributes["href"].Value)
    .ToArray();
L.B
  • 114,136
  • 19
  • 178
  • 224
0

The better way is to use HTML Agility Pack

Asif Mushtaq
  • 13,010
  • 3
  • 33
  • 42
0

Most correct way in pasing HTML is using HTML parser, like HtmlAgilityPack. You can not correctly pass HTML page in other way.

The proove of this are "balanced parentesis" concept. You can not parse ((x)) string with regular expression, cause you need to remember a parse tree, but regular expressions are state-less constructs.

They are not bad, but just not suitable for these type of parsing.

Hope this helps.

Tigran
  • 61,654
  • 8
  • 86
  • 123
0

Below you can find your code, to which i made some changes, and now it should be ok. But certainly you didn't choose the best method for this task.

WebClient c = new WebClient();
String members = c.DownloadString("http://www.powerbot.org/community/members/");
int times = Regex.Matches(members, "<h3 class='ipsType_subtitle'>").Count;
Console.WriteLine(times.ToString());

var member = string.Empty;//extracted value

for (int i = 1; i < times; i++)
{
    try
    {
        int start = members.IndexOf("<h3 class='ipsType_subtitle'>");
        member = members.Substring(start, 500);

        members = members.Remove(start, 500);

        String[] next = member.ToString().Split(new string[] { "a href='" }, StringSplitOptions.None);
        String[] link = next[1].Split(' ');
        Console.WriteLine(link[0].Replace("'", ""));
    }
    catch(Exception e) { Console.WriteLine("Failed: " + e.ToString()); }
}

Console.Read();
Ion Sapoval
  • 635
  • 5
  • 8