1

I am trying to make an addon to a game named Tibia.

On their website Tibia.com you can search up people and see their deaths.

forexample:

http://www.tibia.com/community/?subtopic=characters&name=Kixus

Now I want to read the deaths data by using Regex in my C# application.

But I cannot seem to work it out, I've been spending hours and hours on

http://myregextester.com/index.php

The expression I use is :

<tr bgcolor=(?:"#D4C0A1"|"#F1E0C6") ><td width="25%" valign="top" >(.*?)?#160;CET</td><td>((?:Died|Killed) at Level ([^ ]*)|and) by (?:<[^>]*>)?([^<]*).</td></tr>

But I cannot make it work.

I want the Timestamp, creature / player Level, and creature / player name

Thanks in advance.

-Regards

Tom
  • 2,973
  • 3
  • 28
  • 32

4 Answers4

2

It's a bad idea to use regular expressions to parse HTML. They're a very poor tool for the job. If you're parsing HTML, use an HTML parser.

For .NET, the usual recommendation is to use the HTML Agility Pack.

Joe White
  • 94,807
  • 60
  • 220
  • 330
1

As suggested by Joe White, you would have a much more robust implementation if you use an HTML parser for this task. There is plenty of support for this on StackOverflow: see here for example.

If you really have to use regexs

I would recommend breaking your solution down into simpler regexs which can be applied using a top down parsing approach to get the results.

For example:

  1. use a regex on the whole page which matches the character table

    I would suggest matching the shortest unique string before and after the table rather than the table itself, and capturing the table using a group, since this avoids having to deal with the possibility of nested tables.

  2. use a regex on the character table that matches table rows

  3. use a regex on the first cell to match the date
  4. use a regex on the second cell to match links
  5. use a regex on the second cell to match the players level
  6. use a regex on the second cell to match the killers name if it was a creature (there are no links in the cell)

This will be much more maintainable if the site changes its Html structure significantly.

A complete working implementation using HtmlAgilityKit

You can dowload the library from the HtmlAgilityKit site on CodePlex.

// This class is used to represent the extracted details
public class DeathDetails
{
    public DeathDetails()
    {
        this.KilledBy = new List<string>();
    }

    public string DeathDate { get; set; }
    public List<String> KilledBy { get; set; }
    public int PlayerLevel { get; set; }
}

public class CharacterPageParser
{
    public string CharacterName { get; private set; }

    public CharacterPageParser(string characterName)
    {
        this.CharacterName = characterName;
    }

    public List<DeathDetails> GetDetails()
    {
        string url = "http://www.tibia.com/community/?subtopic=characters&name=" + this.CharacterName;
        string content = GetContent(url);

        HtmlDocument document = new HtmlDocument();
        document.LoadHtml(content);

        HtmlNodeCollection tables = document.DocumentNode.SelectNodes("//div[@id='characters']//table");

        HtmlNode table = GetCharacterDeathsTable(tables);
        List<DeathDetails> deaths = new List<DeathDetails>();

        for (int i = 1; i < table.ChildNodes.Count; i++)
        {
            DeathDetails details = BuildDeathDetails(table, i);
            deaths.Add(details);
        }
        return deaths;
    }

    private static string GetContent(string url)
    {
        using (System.Net.WebClient c = new System.Net.WebClient())
        {
            string content = c.DownloadString(url);
            return content;
        }
    }

    private static DeathDetails BuildDeathDetails(HtmlNode table, int i)
    {
        DeathDetails details = new DeathDetails();

        HtmlNode tableRow = table.ChildNodes[i];

        //every row should have two cells in it
        if (tableRow.ChildNodes.Count != 2)
        {
            throw new Exception("Html format may have changed");
        }

        HtmlNode deathDateCell = tableRow.ChildNodes[0];
        details.DeathDate = System.Net.WebUtility.HtmlDecode(deathDateCell.InnerText);

        HtmlNode deathDetailsCell = tableRow.ChildNodes[1];
        // get inner text to parse for player level and or creature name
        string deathDetails = System.Net.WebUtility.HtmlDecode(deathDetailsCell.InnerText);

        // get player level using regex
        Match playerLevelMatch = Regex.Match(deathDetails, @" level ([\d]+) ", RegexOptions.IgnoreCase);
        int playerLevel = 0;
        if (int.TryParse(playerLevelMatch.Groups[1].Value, out playerLevel))
        {
            details.PlayerLevel = playerLevel;
        }

        if (deathDetailsCell.ChildNodes.Count > 1)
        {
            // death details contains links which we can parse for character names

            foreach (HtmlNode link in deathDetailsCell.ChildNodes)
            {
                if (link.OriginalName == "a")
                {
                    string characterName = System.Net.WebUtility.HtmlDecode(link.InnerText);
                    details.KilledBy.Add(characterName);
                }
            }
        }
        else
        {
            // player was killed by a creature - capture creature name
            Match creatureMatch = Regex.Match(deathDetails, " by (.*)", RegexOptions.IgnoreCase);
            string creatureName = creatureMatch.Groups[1].Value;
            details.KilledBy.Add(creatureName);
        }
        return details;
    }

    private static HtmlNode GetCharacterDeathsTable(HtmlNodeCollection tables)
    {
        foreach (HtmlNode table in tables)
        {
            // Get first row
            HtmlNode tableRow = table.ChildNodes[0];

            // check to see if contains enough elements
            if (tableRow.ChildNodes.Count == 1)
            {
                HtmlNode tableCell = tableRow.ChildNodes[0];
                string title = tableCell.InnerText;

                // skip this table if it doesn't have the right title
                if (title == "Character Deaths")
                {
                    return table;
                }
            }
        }

        return null;
    }

And an example of it in use:

 CharacterPageParser kixusParser = new CharacterPageParser("Kixus");

        foreach (DeathDetails details in kixusParser.GetDetails())
        {
            Console.WriteLine("Player at level {0} was killed on {1} by {2}", details.PlayerLevel, details.DeathDate, string.Join(",", details.KilledBy));
        }
Community
  • 1
  • 1
sga101
  • 1,904
  • 13
  • 12
0

try this :

http://jsbin.com/atupok/edit#javascript,html

and continue from there .... I did the most job here :)

edit

http://jsbin.com/atupok/3/edit

and start using this tool

http://regexr.com?2vrmf

not the one you have.

Royi Namir
  • 144,742
  • 138
  • 468
  • 792
  • Hello friend. Thank you for your fast answer. I can understand it, and it looks nice thank you. But when I insert it into http://myregextester.com/index.php and modify it abit, I still get errors. – user1175245 Jan 28 '12 at 14:43
  • Thank you again friend.. I am trying to apply the regexp "([^ \<]+)[\S\s]+?Killed[ ]+at[ ]+level[ ]+([0-9]+)[ ]+by[ ]+[^\&]+\&name=([^ \"]+)" into my C# application MatchCollection deaths = Regex.Matches(html, @"([^ \<]+)[\S\s]+?Killed[ ]+at[ ]+level[ ]+([0-9]+)[ ]+by[ ]+[^\&]+\&name=([^ \"]+)", RegexOptions.SingleLine); Do I need to use javascript instead of this? – user1175245 Jan 28 '12 at 15:07
  • why javascript if youre using c# ? – Royi Namir Jan 28 '12 at 15:09
0

You can also use Espresso tool to work out proper regular expression.

To properly escape all special characters that are not parts of regular expression you can use Regex.Escape method:

string escapedText = Regex.Escape("<td width=\"25%\" valign=\"top\" >");
Andrii Kalytiiuk
  • 1,501
  • 14
  • 26
  • Hello again. Thanks for your fast reply. I tried Espresso.. And it seems that after (.*?)?#160;CET I get problems.. – user1175245 Jan 28 '12 at 14:42
  • Also there is a problem even in part from your comment - capture that supposed to be date match (.*?)? takes all remaining text in the row till the date in next row - when you replace it with ([^(CET)]*) - it will match only date from first field (without #160;CET in the end). – Andrii Kalytiiuk Jan 29 '12 at 05:34
  • I would also recommend to replace spaces with \s+ as it is difficult to check extensive volume of text for count of spaces between each words pair - like on your web site - after Killed word. – Andrii Kalytiiuk Jan 29 '12 at 09:06
  • Try this one: \\((?:(?!\#160\;CET).)*)#160;CET\\(?:(?:Died|Killed)\s+at Level ([\d]*))\ by\ (?:a\ )?(?:(?:<\s*a[^\<]*>)?([^<]+)(?:<\s*\/\s*a\s*>)?(?:\,\s+|\s+and\s+)?)+(?:(?!<\/td).)*\.\ – Andrii Kalytiiuk Jan 29 '12 at 10:30