1

I want to replace all links of a webpage to a reverse proxy domain.

The rules are

https://test.com/xxx --> https_test_com.proxy.com/xxx
http://sub.test.com/xxx --> http_sub_test_com.proxy.com/xxx

How to achieve it by regex in golang?

The type of response body is []byte, and character encoding of it is UTF-8.
I have tried in this way. But it cannot replace all the dot to underscore in the origin domain. The length of subdomain is variable, that means the number of dot can vary

respBytes := []byte(`_.Xc=function(a){var b=window.google&&window.google.logUrl?"":"https://www.google.com";b+="/gen_204?";b+=a.j(2040-b.length);
        <cite class="iUh30 Zu0yb tjvcx">https://cloud.google.com</cite></div><div class="eFM0qc"><a class="fl" href="https://webcache.googleusercontent.com/search?q=cache:80SWJ_cSDhwJ:https://cloud.google.com/+&amp;cd=1&amp;hl=en&amp;ct=clnk&amp;gl=au" ping="/url?sa=t&amp;source=web&amp;rct=j&amp;url=https://webcache.googleusercontent.com/search%3Fq%3Dcache:80SWJ_cSDhwJ:https://cloud.google.com/%2B%26cd%3D1%26hl%3Den%26ct%3Dclnk%26gl%3Dau&amp;ved=2ahUKEwia5ovYsv3xAhXS4jgGHad0BJYQIDAAegQIBRAG"><span>Cached</span></a></li><li class="action-menu-item OhScic zsYMMe" role="menuitem"><a class="fl" href="/search?q=related:https://cloud.google.com/+google+cloud&amp;sa=X&amp;ved=2ahUKEwia5ovYsv3xAhXS4jgGHad0BJYQHzAAegQIBRAH">
        `)
proxyURI := "proxy.com"
var re = regexp.MustCompile(`(http[s]*):\/\/([a-zA-Z0-9_\-.:]*)`)
content := re.ReplaceAll(respBytes, []byte("${1}_${2}."+proxyURI))


origin result expect
https://www.google.com https_www.google.com.test.com https_www_google_com.test.com
https://cloud.google.com https_cloud.google.com.test.com https_cloud_google_com.test.com
https://https://webcache.googleusercontent.com https_cloud.google.com.test.com https_webcache_googleusercontent_com.test.com
Sage Ren
  • 21
  • 4
  • 1
    Post what you've tried and the results you're getting. If you haven't already, check out the stdlib `regexp` package Replace functions: https://pkg.go.dev/regexp#Regexp.ReplaceAll . – Grokify Jul 25 '21 at 02:51
  • See https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – jub0bs Jul 25 '21 at 10:02

1 Answers1

0

Here's how you can do this:

func replaceAndPrint() {
    src := `
<a href="https://test.com/xxx">link 1</a>
<a href="https://test.com/yyy">link 2</a>
`
    r := regexp.MustCompile("\"https://(test\\.com.*)\"")
    result := r.ReplaceAllString(src, "http://sub.$1")
    fmt.Println(result)
}

Output:

<a href=http://sub.test.com/xxx>link 1</a>
<a href=http://sub.test.com/yyy>link 2</a>

Explanation: regexp.MustCompile's argument defines a capturing group (inside a pair of parentheses). The value of that capturing group is referenced by $1 in the call to r.ReplaceAllString.

UPDATE:

Sorry, misread the example.

Here's an updated version:

func replaceAndPrint2() {
    src := `
<a href="http://test.com/xxx">link 1</a>
<a href="https://sub1.sub2.test.com/yyy">link 2</a>
`
    r := regexp.MustCompile("(\\.|://)([^./]*)")
    replacer := strings.NewReplacer("://", "_", ".", "_")
    res := r.ReplaceAllStringFunc(src, func(g string) string {
        if g == ".com" {
            return replacer.Replace(g) + ".proxy.com"
        }
        return replacer.Replace(g)
    })
    fmt.Println(res)
}

Output:

<a href="http_test_com.proxy.com/xxx">link 1</a>
<a href="https_sub1_sub2_test_com.proxy.com/yyy">link 2</a>
m1kael
  • 2,801
  • 1
  • 15
  • 14
  • 1
    Sorry, the rule is not to replace the domain to its subdomain. – Sage Ren Jul 25 '21 at 05:04
  • Thanks, the `ReplaceAllFunc` and `ReplaceAllStringFunc` work well by matching the target, and **formatting with function** and replacing. – Sage Ren Jul 25 '21 at 12:00