0

Problem: sometimes we are getting links/phrases with invalid(for us) encoding.

Examples and my first solution below

Description: I have to fix invalid encoded strings in one part of the application. Sometimes it is a word or phrase, but somtimes also a url. When its a URL I would like to change only wrongly encoded characters. If I decode with ISO and encode to UTF-8 the special url characters are also encoded (/ : ? = &). I coded a solution, which is working for my cases just fine, but those hashes you will see below are smelling badly to me.

Do you had a similar problem or do you know a library which allows to decode a phrase except some characters? Something like this:

decode(String value, char[] ignored)

I also though about braking URL into pieces and fix only path and query but it would be even more mess with parsing them etc..

TLDR: Decode ISO-8858-1 encoded URL and encode it to UTF-8. Dont touch URL specific characters (/ ? = : &)

Input/Output examples:

// wrong input
"http://some.url/xxx/a/%e4t%fcr%E4/b/%e4t%fcr%E4"
"t%E9l%E9phone"

// good output
"http://some.url/xxx/a/%C3%A4t%C3%BCr%C3%A4/b/%C3%A4t%C3%BCr%C3%A4"
"t%C3%A9l%C3%A9phone"

// very wrong output
"http%3A%2F%2Fsome.url%2Fxxx%2Fa%2F%C3%A4t%C3%BCr%C3%A4%2Fb%2F%C3%A4t%C3%BCr%C3%A4"

My first solution:

class EncodingFixer {
    private static final String SLASH_HASH = UUID.randomUUID().toString();
    private static final String QUESTION_HASH = UUID.randomUUID().toString();
    private static final String EQUALS_HASH = UUID.randomUUID().toString();
    private static final String AND_HASH = UUID.randomUUID().toString();
    private static final String COLON_HASH = UUID.randomUUID().toString();

    EncodingFixer() {
    }

    String fix(String value) {
        if (isBlank(value)) {
            return value;
        }
        return tryFix(value);
    }

    private String tryFix(String str) {
        try {
            String replaced = replaceWithHashes(str);
            String fixed = java.net.URLEncoder.encode(java.net.URLDecoder.decode(replaced, ISO_8859_1), UTF_8); 
            return replaceBack(fixed);
        } catch (Exception e) {
            return str;
        }
    }

    private String replaceWithHashes(String str) {
        return str
            .replaceAll("/", SLASH_HASH)
            .replaceAll("\\?", QUESTION_HASH)
            .replaceAll("=", EQUALS_HASH)
            .replaceAll("&", AND_HASH)
            .replaceAll(":", COLON_HASH);
    }

    private String replaceBack(String fixed) {
        return fixed
            .replaceAll(SLASH_HASH, "/")
            .replaceAll(QUESTION_HASH, "?")
            .replaceAll(EQUALS_HASH, "=")
            .replaceAll(AND_HASH, "&")
            .replaceAll(COLON_HASH, ":");
    }
}

Or it should be more like: ???

  1. Check if input is an URL

  2. Create URL

  3. Get path

  4. Split by /

  5. Fix every part
  6. Put it back together
  7. Same for query but little more complicated
    ??
    I also though about it but it seems even more messy than those replaceAlls above :/
baant
  • 153
  • 2
  • 13
  • 3
    The big missunderstanding of URL encoding is that it is used to encode URLs. It's not. It's used to encode query parameters and path elements of a URL. I strongly recommend that you split the URL into base URL, additional path elements and query parameters and process them separately. – Codo Mar 10 '20 at 13:28
  • 3
    A good rule of thumb is: any time you're writing some form of "encoding fix" by hand, you're probably doing something wrong. – Kayaman Mar 10 '20 at 13:43
  • Sorry to be possibly pedantic, but did you mean ISO-8859-1 (see https://en.wikipedia.org/wiki/ISO/IEC_8859-1) and not ISO-8858-1 (https://www.iso.org/standard/75885.html) which looks like an ISO standard for one particular type of laboratory test (i.e. nothing to do with character sets)? I've come across websites defining 8858 several times and it would appear to be a typo in each case. – Mark Bradley Jul 27 '20 at 10:49

1 Answers1

0

If you are able to recognize clearly that some string is an URL, then following user's @jschnasse answer in similar question on SO, this might be the solution you need:

URL url= new URL("http://some.url/xxx/a/%e4t%fcr%E4/b/%e4t%fcr%E4");
URI uri = new URI(url.getProtocol(), url.getUserInfo(), IDN.toASCII(url.getHost()), url.getPort(), url.getPath(), url.getQuery(), url.getRef());
String correctEncodedURL=uri.toASCIIString(); 
System.out.println(correctEncodedURL);

outputs: http://some.url/xxx/a/%25e4t%25fcr%25E4/b/%25e4t%25fcr%25E4

ptr92zet
  • 173
  • 9