How to/Should I retrieve data from particularly formatted HTML without regex

Question

I have a whole pile of HTML which is just a bunch of this:

<li id="entry-c7" data-user="ThisIsSomeonesUsername">
  <img width="28" height="28" class="avatar" src="http://very_long_url.png">
  <span class="time">6:07</span>
  <span class="username">ThisIsSomeonesUsername</span>
  <span class="message">This is my message. It is nice, no?</span>
</li>

Repeated over and over again about a hundred thousand times (with different content, of course). This is all taken from an HTMLDocument by retrieving the element which holds all this. The document is retrieved from a WebBrowser in a Windows Form. This looks like:

HtmlDocument document = webBrowser1.Document;
HtmlElement element = document.GetElementById(chatElementId);

Assume "chatElementId" is just some known ID. What I would like to do is retrieve the content in "time" (6:07 in this example), "username" (ThisIsSomeonesUsername), and "message" (This is my message... etc.). The message portion can contain almost anything, including further html (such as links, images, etc.), but I want to keep all that intact. I was going to use a regular expression to parse the InnerHtml of the element retrieved using the method above, but apparently this will bring about the destruction of the universe. How then should I go about doing this?

Edit: People keep suggesting Html Agility Pack, so is there an easy way to go about doing this in Html Agility Pack without using the full HTML source? I'm not sure if the rest of the html outside of this class is all that great... but should I just pass the whole html anyway?

If you know the answer, then why ask the question? Don't use regex. — John Saunders, Nov 14 '13 at 00:25
Try using HTML Agility Pack http://htmlagilitypack.codeplex.com/ — Vincent Dagpin, Nov 14 '13 at 00:26
Why won't you use a third party library? That's what they are for.. making your life easy. — Simon Whitehead, Nov 14 '13 at 00:27
I'm sure they would make life easier, but the html I'm parsing is months worth of chat data, and I'd rather not use a large tool for a small job if it means introducing extra slowness. I mean, "agility" is in the name of Html Agility Pack, so I'm sure it's not bad. — Carlos Sanchez, Nov 14 '13 at 00:28
It's settled then, use the Html Agility Pack and you'll save yourself a lot of pain! ;) — Andrew Savinykh, Nov 14 '13 at 00:32
But is it really necessary to use a multi-faceted tool when the data is exactly formatted and I'm really only dealing with a tiny bit of data extraction? It's not HTML that someone wrote by hand, it's a bunch of generated html. Is it just a rule to not use regex on HTML? — Carlos Sanchez, Nov 14 '13 at 00:34

score 1 · Answer 1 · edited May 23 '17 at 10:25

Just an FYI Regex cant parse HTML in any usable fasion... RegEx match open tags except XHTML self-contained tags just for those that stumble across this post.

Now for your requirement have you tried using XmlDocument or XDocument?

Just try the following (note the img tag is missing the end />) if that is the case in your HTML this wont work as its not valid XML).

//parse the xml
var xDoc = XDocument.Parse(html);

//create our list of results (basic tuple here, could be your class)
List<Tuple<string, string, string>> attributes = new List<Tuple<string, string, string>>();

//iterate all li elemenets
foreach (var element in xDoc.Root.Elements("li"))
{
    //set the default values
    string time = "",
            username = "",
            message = "";

    //get the time, username message attributes
    XElement tElem = element.Elements("span").FirstOrDefault(x => x.Attributes("class").Count() > 0 && x.Attribute("class").Value == "time");
    XElement uElem = element.Elements("span").FirstOrDefault(x => x.Attributes("class").Count() > 0 && x.Attribute("class").Value == "username");
    XElement mElem = element.Elements("span").FirstOrDefault(x => x.Attributes("class").Count() > 0 && x.Attribute("class").Value == "message");

    //set our values based on element results
    if (tElem != null)
        time = tElem.Value;

    if (uElem != null)
        username = uElem.Value;

    if (mElem != null)
        message = mElem.Value;

    //add to our list
    attributes.Add(new Tuple<string, string, string>(time, username, message));
}

Does the image tag need />? http://www.w3schools.com/tags/tag_img.asp Other than that, this looks perfect. I guess I'll just break down and use an external library then. — Carlos Sanchez, Nov 14 '13 at 00:37
In the example provided it wont work if the img tag is missing the /> at the end as its not valid XML. — Nico, Nov 14 '13 at 00:39
Ah. I'm guessing htmlagilitypack has similar functionality to XmlDocument for html then? — Carlos Sanchez, Nov 14 '13 at 00:43
Your answer is excellent! I love that you took the time to write out all the code. But like you said, the html isn't properly formatted, so either I'd have to use something else, or I'd have to insert the / myself; in which case I would probably use regex. But thank you! I'll use XDocument for future projects if I can. — Carlos Sanchez, Nov 14 '13 at 01:25

score 1 · Accepted Answer · answered Nov 14 '13 at 00:42

1

Read the link on the Nico's answer ... I was about to post the same one (it's hilarious).

Having said that, from your comments it seems like you're intent on regex. So, regex it away.
It shouldn't be hard to do.

Go to http://regexpal.com/, paste your data on the bottom part, play with the regex part on the top until you're happy with the result, and just loop over your data and extract what you need to your heart content.

(I'm not sure if I'd do it, but sometimes a quick fix is better than a long more "correct" answer).

answered Nov 14 '13 at 00:42

Noctis

11,507
3
43
82

Could using regex on this very small "subset" of html really cause problems though? I can totally see why doing it in general is a bad thing, but for this small instance, is it really a problem? – Carlos Sanchez Nov 14 '13 at 01:02
As i said, sometimes a small fix is better than the proper way. It depends on your needs, and I can't answer that. If you're sure your data input is regular, go for it. If you have problems or it's not enough, then try for the bigger guns :) – Noctis Nov 14 '13 at 01:10
That's true. That website you posted is excellent, by the way! I like both of the answers, as both are helpful, so I don't know who to accept. I'm also worried that if I accept the regex answer, I'll be slammed. Horribly. – Carlos Sanchez Nov 14 '13 at 01:14
It is a useful website. It doesn't matter which one you choose. They are both valid answers. The point isn't about being slammed, it's about what you feel answered your question best I guess. In any case, as long as you understand what's the problem with regex and html, you should be fine :) – Noctis Nov 14 '13 at 01:18
Ehh, nobody's going to be looking at this anyway, so I'm not going to worry about it somehow giving people the wrong idea (even though you made it pretty clear that regex isn't usually the answer). Thank you for that website! – Carlos Sanchez Nov 14 '13 at 01:22

How to/Should I retrieve data from particularly formatted HTML without regex

2 Answers2