2

What's the best way to extract inner substrings from strings in Golang?

input:

"Hello <p> this is paragraph </p> this is junk <p> this is paragraph 2 </p> this is junk 2"

output:

"this is paragraph \n
 this is paragraph 2"

Is there any string package/library for Go that already does something like this?

package main

import (
    "fmt"
    "strings"
)

func main() {
    longString := "Hello world <p> this is paragraph </p> this is junk <p> this is paragraph 2 </p> this is junk 2"

    newString := getInnerStrings("<p>", "</p>", longString)

    fmt.Println(newString)
   //output: this is paragraph \n
    //        this is paragraph 2

}
func getInnerStrings(start, end, str string) string {
    //Brain Freeze
        //Regex?
        //Bytes Loop?
}

thanks

n00dl3
  • 21,213
  • 7
  • 66
  • 76
user3173591
  • 35
  • 1
  • 4
  • 1
    [Here](http://golang.org/pkg/regexp). Read the part about submatches; it should help you. – tenub Jan 08 '14 at 15:53
  • Yeah, I seen that, but I wasn't or sure if that was the right way to go. Bookmarked for future reference though. – user3173591 Jan 08 '14 at 16:15

3 Answers3

6

Don't use regular expressions to try and interpret HTML. Use a fully capable HTML tokenizer and parser.

I recommend you read this article on CodingHorror.

Community
  • 1
  • 1
thwd
  • 23,956
  • 8
  • 74
  • 108
1

Here is my function that I have been using it a lot.

func GetInnerSubstring(str string, prefix string, suffix string) string {
    var beginIndex, endIndex int
    beginIndex = strings.Index(str, prefix)
    if beginIndex == -1 {
        beginIndex = 0
        endIndex = 0
    } else if len(prefix) == 0 {
        beginIndex = 0
        endIndex = strings.Index(str, suffix)
        if endIndex == -1 || len(suffix) == 0 {
            endIndex = len(str)
        }
    } else {
        beginIndex += len(prefix)
        endIndex = strings.Index(str[beginIndex:], suffix)
        if endIndex == -1 {
            if strings.Index(str, suffix) < beginIndex {
                endIndex = beginIndex
            } else {
                endIndex = len(str)
            }
        } else {
            if len(suffix) == 0 {
                endIndex = len(str)
            } else {
                endIndex += beginIndex
            }
        }
    }

    return str[beginIndex:endIndex]
}

You can try it at the playground, https://play.golang.org/p/Xo0SJu0Vq4.

spicydog
  • 1,644
  • 1
  • 17
  • 32
0

StrExtract Retrieves a string between two delimiters.

StrExtract(sExper, cAdelim, cCdelim, nOccur)

sExper: Specifies the expression to search. sAdelim: Specifies the character that delimits the beginning of sExper.

sCdelim: Specifies the character that delimits the end of sExper.

nOccur: Specifies at which occurrence of cAdelim in sExper to start the extraction.

Go Play

package main

import (
    "fmt"
    "strings"
)

func main() {
    s := "a11ba22ba333ba4444ba55555ba666666b"
    fmt.Println("StrExtract1: ", StrExtract(s, "a", "b", 5))
}

func StrExtract(sExper, sAdelim, sCdelim string, nOccur int) string {

    aExper := strings.Split(sExper, sAdelim)

    if len(aExper) <= nOccur {
        return ""
    }

    sMember := aExper[nOccur]
    aExper = strings.Split(sMember, sCdelim)

    if len(aExper) == 1 {
        return ""
    }

    return aExper[0]
}
Opal
  • 81,889
  • 28
  • 189
  • 210
Ali Altun
  • 397
  • 5
  • 14