-2

I want to extract all list items (content of each <li></li>) with Go. Should I use regexp to get the <li> items or is there any other library for this?

My intention is to get a list or array in Go that contains all list item from a specific html web page. How should I do that?

rablentain
  • 6,641
  • 13
  • 50
  • 91
  • 4
    Perhaps the [html package](https://godoc.org/golang.org/x/net/html) can be of use? – Michael Mar 28 '15 at 15:08
  • Do you want to scrape whole websites? – Pravin Mishra Mar 28 '15 at 15:11
  • Don't try [parsing HTML with regexp](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). In general think twice before using regexp for just about anything, it has it's place but far too many people jump to it for every single thing they do. – Dave C Mar 28 '15 at 15:14
  • 1
    @DaveC That is exactly why I started this question, to find if there is any other more appropriate method. I don't understand all the down votes... – rablentain Mar 28 '15 at 15:38

2 Answers2

1

You likely want to use the golang.org/x/net/html package. It's not in the Go standard packages, but instead in the Go Sub-repositories. (The sub-repositories are part of the Go Project but outside the main Go tree. They are developed under looser compatibility requirements than the Go core.)

There is an example in that documentation that may be similar to what you want.

If you need to stick with the Go standard packages for some reason, then for "typical HTML" you can use encoding/xml.

Both packages tend to use an io.Reader for input. If you have a string or []byte variable you can wrap them with strings.NewReader or bytes.Buffer to get an io.Reader.

For HTML it's more likely you'll come from an http.Response body (make sure to close it when done). Perhaps something like:

    resp, err := http.Get(someURL)
    if err != nil {
        return err
    }
    defer resp.Body.Close()

    doc, err := html.parse(resp.Body)
    if err != nil {
        return err
    }
    // Recursively visit nodes in the parse tree
    var f func(*html.Node)
    f = func(n *html.Node) {
        if n.Type == html.ElementNode && n.Data == "a" {
            for _, a := range n.Attr {
                if a.Key == "href" {
                    fmt.Println(a.Val)
                    break
                }
            }
        }
        for c := n.FirstChild; c != nil; c = c.NextSibling {
            f(c)
        }
    }
    f(doc)
}

Of course, parsing fetched web pages won't work for pages that modify their own contents with JavaScript on the client side.

Dave C
  • 7,729
  • 4
  • 49
  • 65
  • Note, I blanked wrt the existence of the `golang.org/x/net/html` package as Go project HTML parser so a previous version of this answer only mentioned the Go standard `encoding/xml` package (which I've used in the past for trivial HTML decoding when I wanted/needed to stick with standard packages). – Dave C Mar 29 '15 at 18:21
0

Here's one way I found to solve this.

If you're trying to extract the text after the li element you first find the li element and then move the tokenizer to the very next element which will be the text (hopefully). You may have to use some logic if the next element is an anchor, span, etc.

resp, err := http.Get(url)
if err!=nil{
    log.Fatal(err)
}
defer resp.Body.Close()

z := html.NewTokenizer(bufio.NewReader(resp.Body))
for {
    tt := z.Next()
    switch tt {
    case html.ErrorToken:
        return
    case html.StartTagToken:
        t := z.Token()
        swith t.Data {
        case "li":
            z.Next()
            t = z.Token()
            fmt.Println(t.Data)
        }
    }
}

but really, you should just use github.com/PuerkitoBio/goquery

Sebastian
  • 1,623
  • 19
  • 23