6

I am trying to get this URL using JSoup

http://betatruebaonline.com/img/parte/330/CIGUEÑAL.JPG

Even using encoding, I got an exception. I don´t understand why the encoding is wrong. It returns

http://betatruebaonline.com/img/parte/330/CIGUEN%C3%91AL.JPG

instead the correct

http://betatruebaonline.com/img/parte/330/CIGUEN%CC%83AL.JPG

How I can fix this ? Thanks.

private static void GetUrl()
{
    try
    {
        String url = "http://betatruebaonline.com/img/parte/330/";
        String encoded = URLEncoder.encode("CIGUEÑAL.JPG","UTF-8");
        Response img = Jsoup
                            .connect(url + encoded)
                            .ignoreContentType(true)
                            .execute();

        System.out.println(url);
        System.out.println("PASSED");
    }
    catch(Exception e)
    {
        System.out.println("Error getting url");
        System.out.println(e.getMessage());
    }
}
ppk
  • 568
  • 1
  • 7
  • 20
  • Well, It's just a file not found exception which is 404 as a http error code when executing. Please make sure the request url resource existing at this time. – tommybee Apr 11 '18 at 07:51
  • First of all, `%C3%91` IS A COMPLETE `Ñ` character, and doesn't require `N` beforehand. so `N%C3%91 ` is indeed a `NÑ` sequence, and not a single char. – Luis Colorado Apr 11 '18 at 08:07

4 Answers4

5

The encoding is not wrong, the problem here is composite unicode & precomposed unicode of character "Ñ" can be displayed in 2 ways, they look the same but really different

precomposed unicode: Ñ           -> %C3%91
composite unicode: N and ~       -> N%CC%83

I emphasize that BOTH ARE CORRECT, it depends on which type of unicode you want:

String normalize = Normalizer.normalize("Ñ", Normalizer.Form.NFD);
System.out.println(URLEncoder.encode("Ñ", "UTF-8")); //%C3%91
System.out.println(URLEncoder.encode(normalize, "UTF-8")); //N%CC%83
yelliver
  • 5,648
  • 5
  • 34
  • 65
4

What happens here?

As stated by @yelliver the webserver seems to use NFD encoded unicode in it's path names. So the solution is to use the same encoding as well.

Is the webserver doing correct?

1. For those who are curious (like me), this article on Multilingual Web Addresses brings some light into the subject. In the section on IRI pathes (the part that is actually handled by the webserver), it states:

Whereas the domain registration authorities can all agree to accept domain names in a particular form and encoding (ASCII-based punycode), multi-script path names identify resources located on many kinds of platforms, whose file systems do and will continue to use many different encodings. This makes the path much more difficult to handle than the domain name.

2. More on the subject on how to encode pathes can be found at Section 5.3.2.2. at the IETF Proposed Standard on Internationalized Resource Identifiers (IRIs) rfc3987. It says:

Equivalence of IRIs MUST rely on the assumption that IRIs are appropriately pre-character-normalized rather than apply character normalization when comparing two IRIs. The exceptions are conversion from a non-digital form, and conversion from a non-UCS-based character encoding to a UCS-based character encoding. In these cases, NFC or a normalizing transcoder using NFC MUST be used for interoperability. To avoid false negatives and problems with transcoding, IRIs SHOULD be created by using NFC. Using NFKC may avoid even more problems; for example, by choosing half-width Latin letters instead of full-width ones, and full-width instead of half-width Katakana.

3. Unicode Consortium states:

NFKC is the preferred form for identifiers, especially where there are security concerns (see UTR #36). NFD and NFKD are most useful for internal processing.

Conclusion

The webserver mentioned in the question does not conform with the recommendations of the IRI standard or the unicode consortium and uses NFD encoding instead of NFC or NFKC. One way to correctly encode an URL-String is as follows

URI uri = new URI(url.getProtocol(), url.getUserInfo(), IDN.toASCII(url.getHost()), url.getPort(), url.getPath(), url.getQuery(), url.getRef());

Then convert that Uri to ASCII string:

String correctEncodedURL=uri.toASCIIString(); 

The toASCIIString() calls encode() which uses NFC encoded unicode. IDN.toASCII() converts the host name to Punycode.

Community
  • 1
  • 1
jschnasse
  • 8,526
  • 6
  • 32
  • 72
  • 1
    Thanks for writing an explanation. It can be helpful for others with same or similar troubles. – ppk Apr 12 '18 at 17:27
  • For further explanation of the code sample provided in this answer take a look [here](https://stackoverflow.com/a/49796882/1485527). – jschnasse Nov 19 '20 at 14:37
1

Actually you have to convert the URL to the decomposed form before URL encoding.

Here is a solution which works using Guava and java.text.Normalizer:

import com.google.common.escape.Escaper;
import com.google.common.net.UrlEscapers;
import org.jsoup.Connection;
import org.jsoup.Jsoup;

import java.text.Normalizer;

public class JsoupImageDownload {

    public static void main(String[] args) {

        String url = "http://betatruebaonline.com/img/parte/330/CIGUEÑAL.JPG";
        String encodedurl = null;
        try {
            encodedurl = Normalizer.normalize(url, Normalizer.Form.NFD);
            Escaper escaper = UrlEscapers.urlFragmentEscaper();
            encodedurl = escaper.escape(encodedurl);
            Connection.Response img = Jsoup
                    .connect(encodedurl)
                    .ignoreContentType(true)
                    .execute();

            System.out.println(url);
            System.out.println("PASSED");
        } catch (Exception e) {
            System.out.println("Error getting url: " + encodedurl);
            System.out.println(e.getMessage());
        }
    }
}

These are the Maven dependencies:

<!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.11.2</version>
</dependency>

<!-- https://mvnrepository.com/artifact/com.google.guava/guava -->
<dependency>
    <groupId>com.google.guava</groupId>
    <artifactId>guava</artifactId>
    <version>24.1-jre</version>
</dependency>
gil.fernandes
  • 12,978
  • 5
  • 63
  • 76
0

Very simple solution: the encode systme provide and that you need is different so, following solution will be good for you.

private static void GetUrl(String url)
{
    try
    {

        String encodedurl = url.replace("Ñ","N%CC%83");
        Response img = Jsoup
                            .connect(encodedurl)
                            .ignoreContentType(true)
                            .execute();

        System.out.println(url);
        System.out.println("PASSED");
    }
    catch(Exception e)
    {
        System.out.println("Error getting url");
        System.out.println(e.getMessage());
    }
}
Dupinder Singh
  • 7,175
  • 6
  • 37
  • 61
  • The trouble is that may be another chars in a list of url's and the code would not work just failing at runtime. That's why can't use this approach. – ppk Apr 11 '18 at 07:58
  • 1
    That solution is incorrect. It will lead to `%` chars being encoded as `%25` sequences, and you'll have more trouble. – Luis Colorado Apr 11 '18 at 08:30
  • 1
    following answer is working fine but there is one more problem Normalizer.Form.NFD how to know which form should use, either NFD Or NFC, and 2 more types available. If we are using NFD that means we consider char is composite char, but every tie its not true – Dupinder Singh Apr 11 '18 at 09:57