0

I am developing a web application with Java and Tomcat 8. This application has a page for uploading a file with the content that will be shown in a different page. Plain simple.

However, these files might contain not-so-common characters as part of their text. Right now, I am working with a file that contains Vietnamese text, for example.

The file is encoded in UTF-8 and can be opened in any text editor. However, I couldn't find any way to upload it and keep the content in the correct encoding, despite searching a lot and trying many different things.

My page which uploads the file contains the following form:

<form method="POST" action="upload" enctype="multipart/form-data" accept-charset="UTF-8" >
                                File: <input type="file" name="file" id="file"  multiple/><br/>
                                Param1: <input type="text" name="param1"/> <br/>
                                Param2: <input type="text" name="param2"/> <br/>
                                <input type="submit" value="Upload" name="upload" id="upload" />
                            </form>

It also contains:

<%@page contentType="text/html" pageEncoding="UTF-8"%>
...
<meta http-equiv="content-type" content="text/html; charset=UTF-8">

My servlet looks like this:

protected void processRequest(HttpServletRequest request, HttpServletResponse response)
            throws ServletException, IOException {
        try {
            response.setContentType("text/html;charset=UTF-8");
            request.setCharacterEncoding("UTF-8");

            String param1 = request.getParameter("param1");

            String param2 = request.getParameter("param2");

            Collection<Part> parts = request.getParts();

            Iterator<Part> iterator = parts.iterator();
            while (iterator.hasNext()) {
                Part filePart = iterator.next();
                InputStream filecontent = null;

                filecontent = filePart.getInputStream();

                String content = convertStreamToString(filecontent, "UTF-8");

                //Save the content and the parameters in the database

                if (filecontent != null) {
                    filecontent.close();
                }
            }

        } catch (ParseException ex) {
        } 
    }

static String convertStreamToString(java.io.InputStream is, String encoding) {
        java.util.Scanner s = new java.util.Scanner(is, encoding).useDelimiter("\\A");
        return s.hasNext() ? s.next() : "";
    }

Despite all my efforts, I have never been able to get that "content" string with the correct characters preserved. I either get something like "K?n" or "Kạn" (which seems to be the ISO-8859-1 interpretation for it), when the correct should be "Kạn".

To add to the problem, if I write Vietnamese characters in the other form parameters (param1 or param2), which also needs to be possible, I can only read them correctly if I set both the form's accept-charset and the servlet scanner encoding to ISO-8859-1, which I definitely don't understand. In that case, if I print the received parameter I get something like "K & # 7 8 4 1 ; n" (without the spaces), which contains a representation for the correct character. So it seems to be possible to read the Vietnamese characters from the form using ISO-8859-1, as long as the form itself uses that charset. However, it never works on the content of the uploaded files. I even tried to encode the file in ISO-8859-1, to use the charset for everything, but it does not work at all.

I am sure this type of situation is not that rare, so I would like to ask some help from the people who might have been there before. I am probably missing something, so any help is appreciated.

Thank you in advance.


Edit 1: Although this question is yet to receive a reply, I will keep posting my findings, in case someone is interested or following it.

After trying many different things, I seem to have narrowed down the causes of problem. I created a class which reads a file from a specific folder in the disk and prints its content. The code goes:

public static void openFile() {
    System.out.println(String.format("file.encoding: %s", System.getProperty("file.encoding")));
    System.out.println(String.format("defaultCharset: %s", Charset.defaultCharset().name()));

    File file = new File(myFilePath);
    byte[] buffer = new byte[(int) file.length()];
    BufferedInputStream f = null;
    String content = null;
    try {
        f = new BufferedInputStream(new FileInputStream(file));
    } catch (FileNotFoundException ex) {
    }

    try {
        f.read(buffer);
        content = new String(buffer, "UTF-8");
        System.out.println("UTF-8 File: " + content);
        f.close();
    } catch (IOException ex) {
    }
}

Then I added a main function to this class, making it executable. When I run it standalone, I get the following output:

file.encoding: UTF-8
defaultCharset: UTF-8
UTF-8 File: {"...Kạn..."}

However, if run the project as a webapp, as it is supposed to be, and call the same function from that class, I get:

file.encoding: Cp1252
defaultCharset: windows-1252
UTF-8 File: {"...K?n..."}

Of course, this was clearly showing that the default encoding used by the webapp to read the file was not UTF-8. So I did some research on the subject and found the classical answer of creating a setenv.bat for Tomcat and having it execute:

set "JAVA_OPTS=%JAVA_OPTS% -Dfile.encoding=UTF-8"

The result, however, is still not right:

file.encoding: UTF-8
defaultCharset: UTF-8
UTF-8 File {"...Kạn..."}

I can see now that the default encoding became UTF-8. The content read from the file, however, is still wrong. The content shown above is the same I would get if I opened the file in Microsoft Word, but chose to read it using ISO-Latin-1 instead of UTF-8. For some odd reason, reading the file is still working with ISO-Latin-1 somewhere, although everything points out to the use of UTF-8.

Again, if anyone might have suggestions or directions for this, it will be highly appreciated.

27 de Abril
  • 117
  • 1
  • 11

1 Answers1

0

I don't seem to be able to close the question, so let me contribute with the answer I found.

The problem is that investigating this type of issue is very tricky, since there are many points in the code where the encoding might be changed (the page, the form encoding, the request encoding, file reading, file writing, console output, database writing, database reading...).

In my case, after doing everything that I posted in the question, I lost a lot of time trying to solve an issue that didn't exist any longer, just because the console output in my IDE (NetBeans, for that project) didn't use the desired character encoding. So I was doing everything right to a certain point, but when I tried to print anything I would get it wrong. After I started writing my logs to files, instead of the console, and thus controlling the writing encoding, I started to understand the issue clearly.

What was missing in my solution, after everything I had already described in my question (before the edit), was to configure the encoding for the database connection. To my surprise, even though my database and all of my tables were using UTF-8, the comunication between the application and MySQL was still in ISO-Latin. The last thing that was missing was adding "useUnicode=true&characterEncoding=utf-8" to the connection, just like this:

con = DriverManager.getConnection("jdbc:mysql:///dbname?useUnicode=true&characterEncoding=utf-8", "user", "pass");

Thanks to this answer, amongst many others: https://stackoverflow.com/a/3275661/843668

Community
  • 1
  • 1
27 de Abril
  • 117
  • 1
  • 11