4

I am working on RSS news parser. I can get very different URLs in contents: with escaped/ not escaped or url-encoded/not url encoded hrefs:

URL-encoded:

https://www.lefigaro.fr/flash-eco/la-russie-a-gagne-93-0220613#:~:text=La%20Russie%20a%20engrang%C3%A9%2093,qui%20%C3%A9pingle%20particuli%C3%A8rement%20la%20France

Escaped:

http://mp.weixin.qq.com/s?__biz=MzI3MjE0NDA1MQ==&mid=2658568&idx=1&sn=b50084652c901&chksm=f0cb0fabcee7d4&scene=21#wechat_redirect

Not encoded & not escaped:

https://newsquawk.com/daily/article?id=2490-us-market-open-concerns&utm_source=tradingview&utm_medium=research&utm_campaign=partner-post

Additionally, RSSs initially may contain some uncoded unsafe symbols:

https://www.unsafe.com/a<b>c{d}e[f ]\g^

I need to make all the URLs formally "safe". Seems the only way to get formally safe URL is to completely unescape & decode it first?


Can I somehow normalize all the different URLs? Is there a way to get completely unescaped & decoded URL in golang?

func(url string) (completelyDecodedUrl string, error) {
    // ??
}
Vladimir Bershov
  • 2,701
  • 2
  • 21
  • 51

1 Answers1

1

The URL encoded example is good as-is, that's how you transmit data as part of the URL. If you need the decoded version, parse the URL and print its URL.Fragment field.

As to the second, simply use html.Unescape().

For example:

s := "https://www.lefigaro.fr/flash-eco/la-russie-a-gagne-93-0220613#:~:text=La%20Russie%20a%20engrang%C3%A9%2093,qui%20%C3%A9pingle%20particuli%C3%A8rement%20la%20France"
u, err := url.Parse(s)
if err != nil {
    panic(err)
}

fmt.Println(u.Fragment)

s2 := "http://mp.weixin.qq.com/s?__biz=MzI3MjE0NDA1MQ==&amp;mid=2658568&amp;idx=1&amp;sn=b50084652c901&amp;chksm=f0cb0fabcee7d4&amp;scene=21#wechat_redirect"
fmt.Println(html.UnescapeString(s2))

This will output (try it on the Go Playground):

:~:text=La Russie a engrangé 93,qui épingle particulièrement la France
http://mp.weixin.qq.com/s?__biz=MzI3MjE0NDA1MQ==&mid=2658568&idx=1&sn=b50084652c901&chksm=f0cb0fabcee7d4&scene=21#wechat_redirect

You do not need to decode the link, as the encoded form is the valid one. You must use the encoded form, the receiving server is the one who needs to decode it.

To detect if the URL is HTML escaped, you may check if it contains the semicolon character ; as it is reserved in URLs (see RFC 1738), and HTML escape sequences contain the semicolon character. So decode() may look like this:

func decode(s string) string {
    if strings.IndexByte(s, ';') >= 0 {
        s = html.UnescapeString(s)
    }
    return s
}

If you're afraid of malicious or invalid URLs, you may parse and reencode the URL:

func decode(s string) (string, bool) {
    if strings.IndexByte(s, ';') >= 0 {
        s = html.UnescapeString(s)
    }
    u, err := url.ParseRequestURI(s)
    if err != nil {
        return "", false
    }
    return u.String(), true
}

Testing it:

fmt.Println(decode(`http//foo.bar`))
fmt.Println(decode(`http://foo.bar/doc?query=abc#first`))
fmt.Println(decode(`https://www.unsafe.com/a<b>c{d}e[f ]\g^`))

This will output (try it on the Go Playground):

 false
http://foo.bar/doc?query=abc#first true
https://www.unsafe.com/a%3Cb%3Ec%7Bd%7De%5Bf%20%5D%5Cg%5E true
icza
  • 389,944
  • 63
  • 907
  • 827
  • It is not known in advance whether the link is encoded or escaped or no. And I need to completely decode & unescape the full address, not just the fragment – Vladimir Bershov Jun 18 '22 at 10:20
  • "You do not need to decode the link" - no, I need to completely decode URLs because I send the URLs to frontends (web, android, ios) who then encode them with their frameworks – Vladimir Bershov Jun 18 '22 at 18:03
  • @VladimirBershov No, you still don't have to decode URL params. Encoded URL params are part of the URL. If you need to pass the URL to another entity, you have to pass it as-is. That's the only valid form of URLs. You only have to decode the encoded params if you want to use the values of the params. But if the entity you pass the URL to wants to call the URL, no decoding is needed. – icza Jun 18 '22 at 19:57
  • but what if rss initially contains some uncoded and potentially dangerous special symbols? Seems then I cannot send it to frontend as-is and the only way to be sure the url has a valid form is to completely decode and re-encode it? – Vladimir Bershov Jun 19 '22 at 08:36
  • @VladimirBershov You may validate the URL by parsing it, but reencoding is needless, if parsing succeeds, you may use the original URL string. See edited answer. – icza Jun 19 '22 at 10:07
  • Your `decode` func return `true` for the link `https://www.unsafe.com/ac{d}e[f ]\g^` – Vladimir Bershov Jun 19 '22 at 14:23
  • @VladimirBershov You're right, the URL parser accepts reserved and unallowed characters. Reencoding is needed to obtain a valid URL. Please see edited answer. – icza Jun 19 '22 at 20:02