Back Story
I basically retrieve strings from a database. I alter some text or those strings. Then I upload those strings back to the database, replacing the original strings. After looking at the front-end that displays those strings, I noticed the character issues. I no longer have the original strings, but I do have the updated strings.
The Issue
These strings have characters from other languages in them. They are now not displaying correctly. I looked at the code-points, and it appears that the original charter, which was one code-point, is now two different code-points.
"Je?ro^me" //code-points 8. Code-points: 74, 101, 63, 114, 111, 94, 109, 101
"Jéróme" //code-points 6. Code-points: 74, 233, 114, 243, 109, 101
The question
How do I get "Je?ro^me"
back to "Jéróme"
?
Things that I have tried
- Used Notepad++ to convert the encoding to or from
UTF8
,ANSI
, andWINDOWS-1252
. - Created a Map that looks for things like
e?
and convert them toé
.
Issues with the two attempts to solve the problem
a. The issue still existed after trying different conversions.
b. Two issues here:
- I don't know all of the potential
e?
,o^
, etc to look for. There are over 20,000 files that may cover many languages. - What if I have a sentence that ends in
e?
Things I researched to gain a better understanding of the issue
- What is a "surrogate pair" in Java?
- https://docs.oracle.com/javase/tutorial/i18n/text/supplementaryChars.html
- https://www.w3.org/International/questions/qa-what-is-encoding
- https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
MCVE
import java.util.HashMap;
import java.util.Map;
/**
*https://stackoverflow.com/questions/5903008/what-is-a-surrogate-pair-in-java
*https://docs.oracle.com/javase/tutorial/i18n/text/supplementaryChars.html
*https://www.w3.org/International/questions/qa-what-is-encoding
*https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
* @author sedri
*/
public class App {
static String outputString;
public static void main(String[] args) {
//May approach to fix the issue
//Use a map to replace string issue with the correct character
//The output looks good, but I would need to include all special characters for many languages.
//What if I have a sentence like: How old are thee?
Map<String, String> map = new HashMap();
map.put("e?", "é");
map.put("o^", "ó");
final String string = "Je?ro^me";
final String accentString = "Jéróme";
outputString = string;
map.forEach((t, u) -> {
if(outputString.contains(t))
{
outputString = outputString.replace(t, u);
}
});
System.out.println("Fixed output: " + outputString);
System.out.println("");
//End of my attempt at a solution.
System.out.println("code points: " + string.codePoints().count());
for(int i = 0; i < string.length(); i++)
{
System.out.println(string.charAt(i) + ": " + Character.codePointAt(string, i));
}
System.out.println("");
System.out.println("code points: " + accentString.codePoints().count());
for(int i = 0; i < accentString.length(); i++)
{
System.out.println(accentString.charAt(i) + ": " + Character.codePointAt(accentString, i));
}
System.out.println("");
System.out.println("code points: " + outputString.codePoints().count());
for(int i = 0; i < outputString.length(); i++)
{
System.out.println(outputString.charAt(i) + ": " + Character.codePointAt(outputString, i));
}
System.out.println("");
}
}