1

I have a string s that receives a value from a database. Depending on which server is the database, the value comes in UTF-8 encoding or not and I can't control that.

My problem is that I need to find a way to only encode the value of the string s with URLEncoder.encode(s, "UTF-8") when that value is not UTF-8, otherwise it gives me some unwanted characters.

I can't use juniversalchardet to detect the encoding of the value. How should I approach that to make sure I only encode when needed and get the correct value of the string?

joaopaulopaiva
  • 363
  • 3
  • 16
  • `URLEncoder.encode(s, "UTF-8") when that value is not UTF-8`? is not or is? – ByeBye Aug 09 '17 at 18:38
  • 3
    If you're retrieving the value as a `String`, it's already stored internally in memory as `UTF-16`. Your explanation is wrong, and you're going for the wrong solution. You're probably getting a corrupted value from the database already. – Kayaman Aug 09 '17 at 18:40
  • @ByeBye is not. – joaopaulopaiva Aug 09 '17 at 18:40
  • @Kayaman I know that, but the value inside the string can contain something like this: `like%20this`. – joaopaulopaiva Aug 09 '17 at 18:42
  • Repost of https://stackoverflow.com/questions/6622226/check-if-a-string-is-valid-utf-8-encoded-in-java – Akash Aug 09 '17 at 18:42
  • 1
    Sure, but that has nothing to do with character encoding. If spaces are converted to `%20` it means the string is URL encoded. – Kayaman Aug 09 '17 at 18:43
  • @Kayaman Right. So how can I convert strings that have `%20` to normal ones and also guarantee that strings that do not have `%20` are not affected? – joaopaulopaiva Aug 09 '17 at 18:46
  • 2
    No guarantees, but if you have URL encoded Strings, you need to [URLDecode](https://docs.oracle.com/javase/8/docs/api/java/net/URLDecoder.html) them. – Kayaman Aug 09 '17 at 18:49
  • 2
    @joaopaulo.ps93 You **must** know a priori the form in which a string is encoded. If they hide that detail to you, that interface is not well defined. – Little Santi Aug 09 '17 at 18:51
  • 1
    Why is a database returning url-encoded strings? Are they stored as url-encoded in the record fields? If so, then you don't really need to re-encode them with `URLEncoder`, just use them as-is. Unless the original source did not use UTF-8 when encoding Unicode characters to byte octets when url-encoding, and you want to re-encode the url-encoding using UTF-8 octets. In which case, use `URLDecoder` to decode the url-encoding to a `String` using the original charset (if you don't know that, you are SOL), then use `URLEncoder` to url-encode that `String` using UTF-8. – Remy Lebeau Aug 09 '17 at 20:53
  • Thank you for your help! You were right, I was using `URLEncoder` when I should use `URLDecoder`. @Kayaman, can you kindly post your answer so I can accept it? – joaopaulopaiva Aug 10 '17 at 14:24
  • @joaopaulo.ps93 No problem. There you go. – Kayaman Aug 10 '17 at 17:31

1 Answers1

1

When you have Strings containing %20 (or in general %dd, where dd is a hex value 00-FF) it's URL Encoded. In a nutshell it escapes "unsafe" characters, which may not be safely included as is in URLs (and some other places). To reverse that you need to use URLDecoder.

As always when dealing with character conversions, you need to specify the encoding. UTF-8 is recommended, so unless you know you need something else, use that.

Kayaman
  • 72,141
  • 5
  • 83
  • 121