2

For some background info, I'm new to Go (3 or 4 days), but I'm starting to get more comfortable with it.

I'm trying to use goquery to parse a webpage. (Eventually I want to put some of the data in a database). For my problem, an example will be the easiest way to explain it:

<html>
    <body>
        <h1>
            <span class="text">Go </span>
        </h1>
        <p>
            <span class="text">totally </span>
            <span class="post">kicks </span>
        </p>
        <p>
            <span class="text">hacks </span>
            <span class="post">its </span>
        </p>
        <h1>
            <span class="text">debugger </span>
        </h1>
        <p>
            <span class="text">should </span>
            <span class="post">be </span>
        </p>
        <p>
            <span class="text">called </span>
            <span class="post">ogle </span>
        </p>
        <h3>
            <span class="statement">true</span>
        </h3>
    </body>
<html>

I'd like to:

  1. Extract the content of <h1..."text".
  2. Insert (and concatenate) this extracted content into the content of <p..."text".
  3. Only do this for the <p> tag that immediately follows the <h1> tag.
  4. Do this for all of the <h1> tags on the page.

So this is what I want it to look like:

<html>
    <body>
        <p>
            <span class="text">Go totally </span>
            <span class="post">kicks </span>
        </p>
        <p>
            <span class="text">hacks </span>
            <span class="post">its </span>
        </p>
        <p>
            <span class="text">debugger should </span>
            <span class="post">be </span>
        </p>
        <p>
            <span class="text">called </span>
            <span class="post">ogle</span>
        </p>
        <h3>
            <span class="statement">true</span>
        </h3>
    </body>
<html>

With the code starting off like this,

package main

import (
    "fmt"
    "strings"
    "github.com/PuerkitoBio/goquery"
)

func main() {
    html_code := strings.NewReader(`code_example_above`)
    doc, _ := goquery.NewDocumentFromReader(html_code)

I know that I can read <h1..."text" with:

h3_tag := doc.Find("h3 .text")

I also know that I can add the content of <h1..."text" to the content of <p..."text" with this:

doc.Find("p .text").Before("h3 .text")

^But this command inserts the content from every single case of <h1..."text" before every single case of <p..."text".

Then, I found out how to get a step closer to what I want:

doc.Find("p .text").First().Before("h3 .text")

^This command inserts the content from every single case of <h1..."text" only before the first case of <p..."text" (which is closer to what I want).

I also tried using goquery's Each() function, but I could not get any closer to what I wanted with that method (though I'm sure there's a way to do it with Each(), right?)

My biggest issue is that I can't figure out how to associate each instance of <h1..."text" with the <p..."text" instance that immediately follows it.

If it helps, <h1..."text" is always followed by <p..."text" on the web pages I'm trying to parse.

My brain's out of juice. Do any Go geniuses know how to do this and are willing to explain it? Thanks in advance.

EDIT

I found out something else I can do:

doc.Find("h1").Each(func(i int, s *goquery.Selection) {
    nex := s.Next().Text()
    fmt.Println(s.Text(), nex, "\n\n")
})

^This prints out what I want--the contents of each instance of <h1..."text" followed by its immediate instance of <p..."text". I had thought that s.Next() would output the next instance of <h1>, but it outputs the next tag in doc--the *goquery.Selection that it's iterating through. Is that correct?

Or, as mattn pointed out, I could also use doc.Find("h1+p").

I'm still having trouble appending <h1..."text" to <p..."text". I'll post it as another question because you can break this one down into multiple questions, and Mattn already answered one.

Community
  • 1
  • 1
GreenRaccoon23
  • 3,603
  • 7
  • 32
  • 46

1 Answers1

1

I don't know what you are writing code with goquery. But maybe, your expected is neighbor selector.

h1+p

This returns h1 tags which has p tag in neighbor.

mattn
  • 7,571
  • 30
  • 54
  • 1
    Whoa! That was easy. That code's a little long though--is there a way to make it shorter? Lol. Thanks! I just figured out another way to do it too that I'll post. – GreenRaccoon23 Jan 06 '15 at 05:34
  • Ok maybe I didn't figure out another way. I almost did though. I'll update my question explaining it. – GreenRaccoon23 Jan 06 '15 at 05:58
  • Well this is ironic. This question is for some code to put data into an sqlite database, and you're the author of the sqlite3 driver! Awesome examples on the git repo. :) – GreenRaccoon23 Jan 06 '15 at 16:22