1

I am using Golang regex package, I want to use regex ReplaceAllStringFunc with argument, not only with the source string.

For example, I want to update this text

"<img src=\"/m/1.jpg\" />  <img src=\"/m/2.jpg\" />  <img src=\"/m/3.jpg\" />"

To (change "m" to "a" or anything else):

"<img src=\"/a/1.jpg\" />  <img src=\"/a/2.jpg\" />  <img src=\"/a/3.jpg\" />"

I would like to have something like:

func UpdateText(text string) string {
    re, _ := regexp.Compile(`<img.*?src=\"(.*?)\"`)
    text = re.ReplaceAllStringFunc(text, updateImgSrc) 
    return text
}

// update "/m/1.jpg" to "/a/1.jpg" 
func updateImgSrc(imgSrcText, prefix string) string {
    // replace "m" by prefix
    return "<img src=\"" + newImgSrc + "\""
}

I checked the doc, ReplaceAllStringFunc doesn't support argument, but what would be the best way to achieve my goal?

More generally, I would like to find all occurrences of one pattern then update each with a new string which is composed by source string + a new parameter, could anyone give any idea?

seaguest
  • 2,510
  • 5
  • 27
  • 45
  • 3
    No, you don't want to process HTML with regexps. – Volker Jun 20 '16 at 10:12
  • @Volker, uhmm, the text is not an entire html, it is a news article's content, what would be the best solution in your opinion? I think strings.Replace can't easily match a pattern. – seaguest Jun 20 '16 at 11:16
  • 2
    Use a proper HTML parser. [`golang.org/x/net/html`](https://godoc.org/golang.org/x/net/html) is one option, and you might find [`github.com/PuerkitoBio/goquery`](https://godoc.org/github.com/PuerkitoBio/goquery) useful. Do [this search](https://godoc.org/?q=html) to get an overview of what's there. – kostix Jun 20 '16 at 11:29
  • Parsing it as html5 works in a lot of cases, maybe just add a doctype and a manually. Or parse as xml. – Volker Jun 20 '16 at 11:29
  • Of course it supports an argument. Your question is very unclear. – hobbs Jun 20 '16 at 16:00

2 Answers2

2

I agree with the comments, you probably don't want to parse HTML with regular expressions (bad things will happen).

However, let's pretend it's not HTML, and you want to only replace submatches. You could do this

func UpdateText(input string) (string, error) {
    re, err := regexp.Compile(`img.*?src=\"(.*?)\.(.*?)\"`)
    if err != nil {
        return "", err
    }
    indexes := re.FindAllStringSubmatchIndex(input, -1)

    output := input
    for _, match := range indexes {
        imgStart := match[2]
        imgEnd := match[3]
        newImgName := strings.Replace(input[imgStart:imgEnd], "m", "a", -1)
        output = output[:imgStart] + newImgName + input[imgEnd:]
    }
    return output, nil
}

see on playground

(note that I've slightly changed your regular expression to match the file extension separately)

Dean Elbaz
  • 2,310
  • 17
  • 17
  • 1
    >bad things will happen http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Kaedys Jun 20 '16 at 20:51
1

thanks for kostix's advice, here is my solution using html parser.

func UpdateAllResourcePath(text, prefix string) (string, error) {
    doc, err := goquery.NewDocumentFromReader(strings.NewReader(text))
    if err != nil {
        return "", err
    }

    sel := doc.Find("img")
    length := len(sel.Nodes)
    for index := 0; index < length; index++ {
        imgSrc, ok := sel.Eq(index).Attr("src")
        if !ok {
            continue
        }

        newImgSrc, err := UpdateResourcePath(imgSrc, prefix)    // change the imgsrc here
        if err != nil {
            return "", err
        }

        sel.Eq(index).SetAttr("src", newImgSrc)
    }

    newtext, err := doc.Find("body").Html()
    if err != nil {
        return "", err
    }

    return newtext, nil
}
seaguest
  • 2,510
  • 5
  • 27
  • 45