0

I would like to extract the both source attributes from this html-snippet:

<audio controls>
<source src="horse.mp3" type="audio/mpeg">
<source src="horse.ogg" type="audio/ogg">
<embed height="50" width="100" src="horse.mp3">
</audio>

Here is what I do:

First off all, im extracting all audio-tags (including the one you can see above):

var audio_tags = HtmlDoc.DocumentNode.SelectNodes("//" + HTML.TAG_AUDIO); 

After that, I try to extract the source-elements using from the HtmlNodeCollection audio_tags using this piece of code:

foreach (HtmlNode link in audio_tags)
        {
            if (link != null)
            {
                string url;
                string type;
                // select all source tags, see here for an example: http://www.w3schools.com/html/html_sounds.asp
                if(link.HasChildNodes)
                {
                    var children = link.ChildNodes;
                    if (children != null)
                    {
                        foreach (HtmlNode child in children)
                        {
                            Console.WriteLine(children[0].GetAttributeValue("type", "err").ToString() + "||" + children[0].OriginalName);
                            Console.WriteLine(children[1].GetAttributeValue("type", "errrr").ToString() + "||" + children[1].OriginalName);
 ...

The writelines indicate, that the first element isn't existing, because "err" is printed. But it should rather be the first source-element. I'd be glad about some hints.

edit:

The output from those writlines is:

 err||#text
 audio/mpeg||source

And the nr. of elements of children is 2.

user1826831
  • 735
  • 3
  • 9
  • 17
  • What is this _"audio controls"_ element? An element in XML can't have a whitespace. If "controls" is an attribute, it must have a value. – Cédric Bignon Feb 05 '13 at 16:01
  • Ur right! audio is a new html5 element. And [here](http://www.w3schools.com/tags/att_audio_controls.asp) it is explained. To summarize it: In XHTML, attribute minimization is forbidden, and the controls attribute must be defined as – user1826831 Feb 05 '13 at 17:18

1 Answers1

1

The first problem is your <source> tag which is not closed. AgilityPack auto closes it such a way, that the second <source> and <embed> tags are inside the first <source> tag. But AgilityPack knows that <embed> is a self-closing tag. Fortunately, there is a way to say, that you want to treat a tag as a self closing tag:

HtmlNode.ElementsFlags.Add("source", HtmlElementFlag.Empty);

The second problem is a text nodes. Every line break/spaces sequence is converted into the text node. I assume you want to get rid of them, so these kind of nodes may be skipped.

And the last one, you could improve the readability of your code by using LINQ or xpath with AgilityPack. Here is an example:

doc.LoadHtml(html);
doc.DocumentNode
    .Descendants("audio")
    .SelectMany(a =>
        a.ChildNodes.Where(n => n.GetType() != typeof(HtmlTextNode))
    ).ToList()
    .ForEach(n => 
        Console.WriteLine("{0}||{1}", n.GetAttributeValue("type", "err"), n.OriginalName)
    );

This will get you something like:

audio/mpeg||source 
audio/ogg||source 
err||embed
Oleks
  • 31,955
  • 11
  • 77
  • 132
  • Thanks, this helped me a lot. It works for now. But I dont get this: "Every line break/spaces sequence is converted into the text node." Could you give me an example? – user1826831 Feb 13 '13 at 14:26
  • 1
    @user1826831: in your example `` tags are divided by a newline and/or spaces symbols. AgilityPack converts these (and others also) text blocks into `HtmlTextNode`. So there are `HtmlTextNodes` between your `` and `` tags. That's why you got the `#text` value in your output (`#text` is a sign of a text node). – Oleks Feb 13 '13 at 14:56
  • Thanks alot. Now I've got it. Should've read the documentation ;) Are there any tutorials for htmlagilitypack-usage? – user1826831 Feb 14 '13 at 09:59
  • 1
    @user1826831: I'd recommend [these](http://stackoverflow.com/a/2588910/102112) series of articles – Oleks Feb 14 '13 at 10:11