C# regex data from website

Question

I am trying to make an addon to a game named Tibia.

On their website Tibia.com you can search up people and see their deaths.

forexample:

http://www.tibia.com/community/?subtopic=characters&name=Kixus

Now I want to read the deaths data by using Regex in my C# application.

But I cannot seem to work it out, I've been spending hours and hours on

http://myregextester.com/index.php

The expression I use is :

<tr bgcolor=(?:"#D4C0A1"|"#F1E0C6") ><td width="25%" valign="top" >(.*?)?#160;CET</td><td>((?:Died|Killed) at Level ([^ ]*)|and) by (?:<[^>]*>)?([^<]*).</td></tr>

But I cannot make it work.

I want the Timestamp, creature / player Level, and creature / player name

Thanks in advance.

-Regards

You need to escape your non-alpha numeric characters (such as "<"). — bnieland, Jan 28 '12 at 14:00
sure. To match a
tag, for example, you need to type "\
", thus escaping the non alpha numeric characters. — bnieland, Jan 28 '12 at 17:41

score 2 · Answer 1 · answered Jan 28 '12 at 15:53

2

It's a bad idea to use regular expressions to parse HTML. They're a very poor tool for the job. If you're parsing HTML, use an HTML parser.

For .NET, the usual recommendation is to use the HTML Agility Pack.

answered Jan 28 '12 at 15:53

Joe White

94,807
60
220
330

score 1 · Answer 2 · edited May 23 '17 at 12:27

As suggested by Joe White, you would have a much more robust implementation if you use an HTML parser for this task. There is plenty of support for this on StackOverflow: see here for example.

If you really have to use regexs

I would recommend breaking your solution down into simpler regexs which can be applied using a top down parsing approach to get the results.

For example:

use a regex on the whole page which matches the character table

I would suggest matching the shortest unique string before and after the table rather than the table itself, and capturing the table using a group, since this avoids having to deal with the possibility of nested tables.
use a regex on the character table that matches table rows
use a regex on the first cell to match the date
use a regex on the second cell to match links
use a regex on the second cell to match the players level
use a regex on the second cell to match the killers name if it was a creature (there are no links in the cell)

This will be much more maintainable if the site changes its Html structure significantly.

A complete working implementation using HtmlAgilityKit

You can dowload the library from the HtmlAgilityKit site on CodePlex.

// This class is used to represent the extracted details
public class DeathDetails
{
    public DeathDetails()
    {
        this.KilledBy = new List<string>();
    }

    public string DeathDate { get; set; }
    public List<String> KilledBy { get; set; }
    public int PlayerLevel { get; set; }
}

public class CharacterPageParser
{
    public string CharacterName { get; private set; }

    public CharacterPageParser(string characterName)
    {
        this.CharacterName = characterName;
    }

    public List<DeathDetails> GetDetails()
    {
        string url = "http://www.tibia.com/community/?subtopic=characters&name=" + this.CharacterName;
        string content = GetContent(url);

        HtmlDocument document = new HtmlDocument();
        document.LoadHtml(content);

        HtmlNodeCollection tables = document.DocumentNode.SelectNodes("//div[@id='characters']//table");

        HtmlNode table = GetCharacterDeathsTable(tables);
        List<DeathDetails> deaths = new List<DeathDetails>();

        for (int i = 1; i < table.ChildNodes.Count; i++)
        {
            DeathDetails details = BuildDeathDetails(table, i);
            deaths.Add(details);
        }
        return deaths;
    }

    private static string GetContent(string url)
    {
        using (System.Net.WebClient c = new System.Net.WebClient())
        {
            string content = c.DownloadString(url);
            return content;
        }
    }

    private static DeathDetails BuildDeathDetails(HtmlNode table, int i)
    {
        DeathDetails details = new DeathDetails();

        HtmlNode tableRow = table.ChildNodes[i];

        //every row should have two cells in it
        if (tableRow.ChildNodes.Count != 2)
        {
            throw new Exception("Html format may have changed");
        }

        HtmlNode deathDateCell = tableRow.ChildNodes[0];
        details.DeathDate = System.Net.WebUtility.HtmlDecode(deathDateCell.InnerText);

        HtmlNode deathDetailsCell = tableRow.ChildNodes[1];
        // get inner text to parse for player level and or creature name
        string deathDetails = System.Net.WebUtility.HtmlDecode(deathDetailsCell.InnerText);

        // get player level using regex
        Match playerLevelMatch = Regex.Match(deathDetails, @" level ([\d]+) ", RegexOptions.IgnoreCase);
        int playerLevel = 0;
        if (int.TryParse(playerLevelMatch.Groups[1].Value, out playerLevel))
        {
            details.PlayerLevel = playerLevel;
        }

        if (deathDetailsCell.ChildNodes.Count > 1)
        {
            // death details contains links which we can parse for character names

            foreach (HtmlNode link in deathDetailsCell.ChildNodes)
            {
                if (link.OriginalName == "a")
                {
                    string characterName = System.Net.WebUtility.HtmlDecode(link.InnerText);
                    details.KilledBy.Add(characterName);
                }
            }
        }
        else
        {
            // player was killed by a creature - capture creature name
            Match creatureMatch = Regex.Match(deathDetails, " by (.*)", RegexOptions.IgnoreCase);
            string creatureName = creatureMatch.Groups[1].Value;
            details.KilledBy.Add(creatureName);
        }
        return details;
    }

    private static HtmlNode GetCharacterDeathsTable(HtmlNodeCollection tables)
    {
        foreach (HtmlNode table in tables)
        {
            // Get first row
            HtmlNode tableRow = table.ChildNodes[0];

            // check to see if contains enough elements
            if (tableRow.ChildNodes.Count == 1)
            {
                HtmlNode tableCell = tableRow.ChildNodes[0];
                string title = tableCell.InnerText;

                // skip this table if it doesn't have the right title
                if (title == "Character Deaths")
                {
                    return table;
                }
            }
        }

        return null;
    }

And an example of it in use:

 CharacterPageParser kixusParser = new CharacterPageParser("Kixus");

        foreach (DeathDetails details in kixusParser.GetDetails())
        {
            Console.WriteLine("Player at level {0} was killed on {1} by {2}", details.PlayerLevel, details.DeathDate, string.Join(",", details.KilledBy));
        }

Thank you alot sir... I am in dept to you. That was what I was looking for. — user1175245, Jan 29 '12 at 10:37

Royi Namir · Answer 3 · 2012-01-28T14:49:31.240

0

try this :

http://jsbin.com/atupok/edit#javascript,html

and continue from there .... I did the most job here :)

edit

http://jsbin.com/atupok/3/edit

and start using this tool

http://regexr.com?2vrmf

not the one you have.

edited Jan 28 '12 at 14:49

answered Jan 28 '12 at 14:17

Royi Namir

144,742
138
468
792

Hello friend. Thank you for your fast answer. I can understand it, and it looks nice thank you. But when I insert it into http://myregextester.com/index.php and modify it abit, I still get errors. – user1175245 Jan 28 '12 at 14:43
Thank you again friend.. I am trying to apply the regexp "([^ \<]+)[\S\s]+?Killed[ ]+at[ ]+level[ ]+([0-9]+)[ ]+by[ ]+[^\&]+\&name=([^ \"]+)" into my C# application MatchCollection deaths = Regex.Matches(html, @"([^ \<]+)[\S\s]+?Killed[ ]+at[ ]+level[ ]+([0-9]+)[ ]+by[ ]+[^\&]+\&name=([^ \"]+)", RegexOptions.SingleLine); Do I need to use javascript instead of this? – user1175245 Jan 28 '12 at 15:07
why javascript if youre using c# ? – Royi Namir Jan 28 '12 at 15:09

score 0 · Answer 4 · answered Jan 28 '12 at 14:23

0

You can also use Espresso tool to work out proper regular expression.

To properly escape all special characters that are not parts of regular expression you can use Regex.Escape method:

string escapedText = Regex.Escape("<td width=\"25%\" valign=\"top\" >");

answered Jan 28 '12 at 14:23

Andrii Kalytiiuk

1,501
14
26

Hello again. Thanks for your fast reply. I tried Espresso.. And it seems that after (.*?)?#160;CET I get problems.. – user1175245 Jan 28 '12 at 14:42
Also there is a problem even in part from your comment - capture that supposed to be date match (.*?)? takes all remaining text in the row till the date in next row - when you replace it with ([^(CET)]*) - it will match only date from first field (without #160;CET in the end). – Andrii Kalytiiuk Jan 29 '12 at 05:34
I would also recommend to replace spaces with \s+ as it is difficult to check extensive volume of text for count of spaces between each words pair - like on your web site - after Killed word. – Andrii Kalytiiuk Jan 29 '12 at 09:06
Try this one: \\((?:(?!\#160\;CET).)*)#160;CET\\(?:(?:Died|Killed)\s+at Level ([\d]*))\ by\ (?:a\ )?(?:(?:<\s*a[^\<]*>)?([^<]+)(?:<\s*\/\s*a\s*>)?(?:\,\s+|\s+and\s+)?)+(?:(?!<\/td).)*\.\ – Andrii Kalytiiuk Jan 29 '12 at 10:30

C# regex data from website

4 Answers4