I am developing a web application with Java and Tomcat 8. This application has a page for uploading a file with the content that will be shown in a different page. Plain simple.
However, these files might contain not-so-common characters as part of their text. Right now, I am working with a file that contains Vietnamese text, for example.
The file is encoded in UTF-8 and can be opened in any text editor. However, I couldn't find any way to upload it and keep the content in the correct encoding, despite searching a lot and trying many different things.
My page which uploads the file contains the following form:
<form method="POST" action="upload" enctype="multipart/form-data" accept-charset="UTF-8" >
File: <input type="file" name="file" id="file" multiple/><br/>
Param1: <input type="text" name="param1"/> <br/>
Param2: <input type="text" name="param2"/> <br/>
<input type="submit" value="Upload" name="upload" id="upload" />
</form>
It also contains:
<%@page contentType="text/html" pageEncoding="UTF-8"%>
...
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
My servlet looks like this:
protected void processRequest(HttpServletRequest request, HttpServletResponse response)
throws ServletException, IOException {
try {
response.setContentType("text/html;charset=UTF-8");
request.setCharacterEncoding("UTF-8");
String param1 = request.getParameter("param1");
String param2 = request.getParameter("param2");
Collection<Part> parts = request.getParts();
Iterator<Part> iterator = parts.iterator();
while (iterator.hasNext()) {
Part filePart = iterator.next();
InputStream filecontent = null;
filecontent = filePart.getInputStream();
String content = convertStreamToString(filecontent, "UTF-8");
//Save the content and the parameters in the database
if (filecontent != null) {
filecontent.close();
}
}
} catch (ParseException ex) {
}
}
static String convertStreamToString(java.io.InputStream is, String encoding) {
java.util.Scanner s = new java.util.Scanner(is, encoding).useDelimiter("\\A");
return s.hasNext() ? s.next() : "";
}
Despite all my efforts, I have never been able to get that "content" string with the correct characters preserved. I either get something like "K?n" or "Kạn" (which seems to be the ISO-8859-1 interpretation for it), when the correct should be "Kạn".
To add to the problem, if I write Vietnamese characters in the other form parameters (param1 or param2), which also needs to be possible, I can only read them correctly if I set both the form's accept-charset and the servlet scanner encoding to ISO-8859-1, which I definitely don't understand. In that case, if I print the received parameter I get something like "K & # 7 8 4 1 ; n" (without the spaces), which contains a representation for the correct character. So it seems to be possible to read the Vietnamese characters from the form using ISO-8859-1, as long as the form itself uses that charset. However, it never works on the content of the uploaded files. I even tried to encode the file in ISO-8859-1, to use the charset for everything, but it does not work at all.
I am sure this type of situation is not that rare, so I would like to ask some help from the people who might have been there before. I am probably missing something, so any help is appreciated.
Thank you in advance.
Edit 1: Although this question is yet to receive a reply, I will keep posting my findings, in case someone is interested or following it.
After trying many different things, I seem to have narrowed down the causes of problem. I created a class which reads a file from a specific folder in the disk and prints its content. The code goes:
public static void openFile() {
System.out.println(String.format("file.encoding: %s", System.getProperty("file.encoding")));
System.out.println(String.format("defaultCharset: %s", Charset.defaultCharset().name()));
File file = new File(myFilePath);
byte[] buffer = new byte[(int) file.length()];
BufferedInputStream f = null;
String content = null;
try {
f = new BufferedInputStream(new FileInputStream(file));
} catch (FileNotFoundException ex) {
}
try {
f.read(buffer);
content = new String(buffer, "UTF-8");
System.out.println("UTF-8 File: " + content);
f.close();
} catch (IOException ex) {
}
}
Then I added a main function to this class, making it executable. When I run it standalone, I get the following output:
file.encoding: UTF-8
defaultCharset: UTF-8
UTF-8 File: {"...Kạn..."}
However, if run the project as a webapp, as it is supposed to be, and call the same function from that class, I get:
file.encoding: Cp1252
defaultCharset: windows-1252
UTF-8 File: {"...K?n..."}
Of course, this was clearly showing that the default encoding used by the webapp to read the file was not UTF-8. So I did some research on the subject and found the classical answer of creating a setenv.bat for Tomcat and having it execute:
set "JAVA_OPTS=%JAVA_OPTS% -Dfile.encoding=UTF-8"
The result, however, is still not right:
file.encoding: UTF-8
defaultCharset: UTF-8
UTF-8 File {"...Kạn..."}
I can see now that the default encoding became UTF-8. The content read from the file, however, is still wrong. The content shown above is the same I would get if I opened the file in Microsoft Word, but chose to read it using ISO-Latin-1 instead of UTF-8. For some odd reason, reading the file is still working with ISO-Latin-1 somewhere, although everything points out to the use of UTF-8.
Again, if anyone might have suggestions or directions for this, it will be highly appreciated.