0

I am trying to parse a html document with the Golang xml parser. I have managed it to extract all the <li>elements but if the element contains a link <a>, then the content of the link is ignored. I would like to just ignore the nested <a> and display it's content as plain text but I don't know how.

Here is my code:

d := xml.NewDecoder(resp.Body)
d.Strict = false
d.AutoClose = xml.HTMLAutoClose
d.Entity = xml.HTMLEntity

type list_item struct {
    Data string `xml:",chardata"`
}

for {
    t,_ := d.Token()
    if t == nil {
        break
    }

    switch se := t.(type) {
    case xml.StartElement:
        if se.Name.Local == "li" {
            var q list_item
            d.DecodeElement(&q, &se)

            c.Infof("%+v\n", q)

        }
    }
}

Is there any way to just ignore nested elements and display their content?

rablentain
  • 6,641
  • 13
  • 50
  • 91

2 Answers2

1

Constder using specialized package for parsing HTML. In general, HTML is not XML (XHTML 1.0 is, but documents formatted using it are not very common, and that standard has been deprecated).

An even better approach in my opinion—given your apparent use case,— would be using XPath to extract the necessary information using a query.

As to the question as stated, I think there's no built-in way to do what you want: the xml.Decoder implements the Skip() method but it only allows you to skip over unneeded content; there's nothing returning "inner XML" as is. You could roll this yourself by using xml.Decoder's RawToken(): by immediately rendering whatever it returns until it returns something denoting and end element you're looking for (you'll have to implement support for handling nested elements).

kostix
  • 51,517
  • 14
  • 93
  • 176
  • FYI, it may have been [my failure](http://stackoverflow.com/a/29319185/55504) to remember the `golang.org/x/net/html` Go project package that lead to the OP using `encoding/xml`. Opps. – Dave C Mar 29 '15 at 18:23
0

I found a library that uses the jQuery style of getting html information: http://godoc.org/github.com/PuerkitoBio/goquery

I used that and it solved the problem.

rablentain
  • 6,641
  • 13
  • 50
  • 91