Problem: sometimes we are getting links/phrases with invalid(for us) encoding.
Examples and my first solution below
Description: I have to fix invalid encoded strings in one part of the application. Sometimes it is a word or phrase, but somtimes also a url. When its a URL I would like to change only wrongly encoded characters. If I decode with ISO and encode to UTF-8 the special url characters are also encoded (/ : ? = &). I coded a solution, which is working for my cases just fine, but those hashes you will see below are smelling badly to me.
Do you had a similar problem or do you know a library which allows to decode a phrase except some characters? Something like this:
decode(String value, char[] ignored)
I also though about braking URL into pieces and fix only path and query but it would be even more mess with parsing them etc..
TLDR: Decode ISO-8858-1 encoded URL and encode it to UTF-8. Dont touch URL specific characters (/ ? = : &
)
Input/Output examples:
// wrong input
"http://some.url/xxx/a/%e4t%fcr%E4/b/%e4t%fcr%E4"
"t%E9l%E9phone"
// good output
"http://some.url/xxx/a/%C3%A4t%C3%BCr%C3%A4/b/%C3%A4t%C3%BCr%C3%A4"
"t%C3%A9l%C3%A9phone"
// very wrong output
"http%3A%2F%2Fsome.url%2Fxxx%2Fa%2F%C3%A4t%C3%BCr%C3%A4%2Fb%2F%C3%A4t%C3%BCr%C3%A4"
My first solution:
class EncodingFixer {
private static final String SLASH_HASH = UUID.randomUUID().toString();
private static final String QUESTION_HASH = UUID.randomUUID().toString();
private static final String EQUALS_HASH = UUID.randomUUID().toString();
private static final String AND_HASH = UUID.randomUUID().toString();
private static final String COLON_HASH = UUID.randomUUID().toString();
EncodingFixer() {
}
String fix(String value) {
if (isBlank(value)) {
return value;
}
return tryFix(value);
}
private String tryFix(String str) {
try {
String replaced = replaceWithHashes(str);
String fixed = java.net.URLEncoder.encode(java.net.URLDecoder.decode(replaced, ISO_8859_1), UTF_8);
return replaceBack(fixed);
} catch (Exception e) {
return str;
}
}
private String replaceWithHashes(String str) {
return str
.replaceAll("/", SLASH_HASH)
.replaceAll("\\?", QUESTION_HASH)
.replaceAll("=", EQUALS_HASH)
.replaceAll("&", AND_HASH)
.replaceAll(":", COLON_HASH);
}
private String replaceBack(String fixed) {
return fixed
.replaceAll(SLASH_HASH, "/")
.replaceAll(QUESTION_HASH, "?")
.replaceAll(EQUALS_HASH, "=")
.replaceAll(AND_HASH, "&")
.replaceAll(COLON_HASH, ":");
}
}
Or it should be more like: ???
Check if input is an URL
Create URL
Get path
Split by /
- Fix every part
- Put it back together
- Same for query but little more complicated
??
I also though about it but it seems even more messy than those replaceAlls above :/