0

Good day.

I've a Tomcat's jsp/servlet application that save on a mysql the text inserted on a page in multilanguages. The text are inserted on a textarea present in a jsp page. In order to save them I call a java servlet that read a request post parameters and copy it into db. Tomcat version is 7.0.63. When I read in a servlet a request parameters written on russian and chinese languages the question mark are present. I read them with system out println and also on the mysql table having the same characters. The jsp page is econded with UT-8 (@page pageEncoding and meta http-equiv="Content-Type") and the servlet request (setCharacterEncoding) is also encoded with UTF-8. The Tomcat Connector in server.xml is encoded (URIEncoding) in UTF-8. I have added on httpd.conf on Apache HTTP Server AddDefaultCharset UTF-8. All others languages are econded correctly.

How can I resolve the problem?

Best regards and good work.

Stefano Errani

  • Does this answer your question? [How to set request encoding in Tomcat?](https://stackoverflow.com/questions/6876697/how-to-set-request-encoding-in-tomcat) – Jozef Chocholacek Feb 26 '20 at 10:42
  • This question has ben marken duplicate but it's not. The referenced question is about URIEncoding in Tomcat for GET parameters. That's no the case. – areus Feb 28 '20 at 14:55

2 Answers2

0

Tomcat 7 implements the specs Servlet (3.0) and JSP (2.2). In those specs there are some places where encoding is relevant, and the defined default encoding is ISO-8859-1.

If you want the enduser to be able to input text in UTF-8 in your webapp, and received it correctly to store it in a database you have to take some steps.

The html page where the <form> resides must be encoded in UTF-8

If the page is generated by a Servlet, before calling getWriter have to call response.setContentType("text/html; charset=UTF-8"); Or just: response.setCharacterEncoding("UTF-8");

As the Servlet spec here states that:

If the servlet does not specify a character encoding before the getWriter method of the ServletResponse interface is called or the response is committed, the default ISO-8859-1 is used.

You can read section 5.4 of the spec for more info about it. For example you can set an econding based on the locale.

If the html is generated by a JSP page, the rules for the response character encoding are determined at the section 4.2 of the JSP spec:

The initial response character encoding is set to the CHARSET value of the contentType attribute of the page directive. If the page doesn’t provide this attribute or the attribute doesn’t have a CHARSET value, the initial response character encoding is determined as follows:

• For documents in XML syntax, it is UTF-8.

• For JSP pages in standard syntax, it is the character encoding specified by the BOM, by the pageEncoding attribute of the page directive, or by a JSP configuration element page-encoding whose URL pattern matches the page. Only the character encoding specified for the requested page is used; the encodings of files included via the include directive are not taken into consideration. If there’s no such specification, no initial response character encoding is passed to ServletResponse.setContentType() - the ServletResponse object’s default, ISO-8859-1, is used.

So, you can set it including <%@ page contentType="text/html; charset=UTF-8" %>

on your JSP page generating the form. Note that pageEncoding is also necessary if in your JSP page there are UTF-8 encoded characters on verbatim text.

A convenient way to set common attributes for all the pages in your web app is using a jsp-property-group, including this config on you web.xml

<jsp-config>
    <jsp-property-group>
        <description>Apply to all JSPs</description>
        <url-pattern>*.jsp</url-pattern>
        <page-encoding>UTF-8</page-encoding>
        <default-content-type>text/html; charset=UTF-8</default-content-type>
    </jsp-property-group>
</jsp-config>

The request submitted must be read in UTF-8

On section 3.10 the servlet spec states that:

Currently, many browsers do not send a char encoding qualifier with the Content- Type header, leaving open the determination of the character encoding for reading HTTP requests. The default encoding of a request the container uses to create the request reader and parse POST data must be “ISO-8859-1” if none has been specified by the client request. However, in order to indicate to the developer, in this case, the failure of the client to send a character encoding, the container returns null from the getCharacterEncoding method.

If the client hasn’t set character encoding and the request data is encoded with a different encoding than the default as described above, breakage can occur. To remedy this situation, a new method setCharacterEncoding(String enc) has been added to the ServletRequest interface. Developers can override the character encoding supplied by the container by calling this method. It must be called prior to parsing any post data or reading any input from the request. Calling this method once data has been read will not affect the encoding.

So you need to set request.setCharacterEncoding("UTF-8") is called before any access is done to the request content.

The best way is to implement a filter to set the character encoding if it hasn't been already set:

public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain) throws ServletException {
    if (request.getCharacterEncoding() == null) {
        request.setCharacterEncoding("UTF-8");
    }
    chain.doFilter(request, response);
}

And declare the filter at the beginning of your web.xml (yes, the order is important) with something like this:

<filter>
    <filter-name>Character Encoding Filter</filter-name>
    <filter-class>yourpackage.YourCharacterEncodingFilter</filter-class>
</filter>
<filter-mapping>
    <filter-name>Character Encoding Filter</filter-name>
    <url-pattern>/*</url-pattern>
</filter-mapping>

This way this filter applies to all the requests in the first place before any other filters, so we are sure that the request data hasn't been accessed.

You can change the <url-pattern> element, for a <servlet-name> to apply the filter only to one servlet.

Note that this only applies to POST requests. For GET request, Tomcat 7 uses ISO-8859-1 by default to decode % encoded URI bytes. This can be overrided adding the URIEconding attribute on <Connector> element on the server.xml file, as stated in the Tomcat 7 docs. https://tomcat.apache.org/tomcat-7.0-doc/config/http.html#Common_Attributes

How do I known the parameters are read correctly?

The best way to ensure your webapp is reading the parameters correctly is to write a response from the servlet, encoded in UTF-8, and see who the parameters are printed on your browser.

You could do something like this in your servlet:

response.setContentType("text/html; charset=UTF-8");

PrintWriter writer = response.getWriter();
writer.println("<html><body>");
writer.println("UTF-8 encoded parameter: " + request.getParameter("yourparam");
writer.println("</body></html>");

You cannot rely on text printed with System.out.println to the console, because for example, in Windows the default encoding of the console is CP1252, that is nearly the same of ISO-8859-1.

So if you print on the console UTF-8 chars that are not supported by CP1252 you will see gibberish or question marks on your console. (To change the encoding of the console on windows see this for example: https://superuser.com/questions/269818/change-default-code-page-of-windows-console-to-utf-8)

The Webapp reads and writes UTF-8 text, but it's not stored on the database

If all the above works, but you still you can't store UTF-8 on your database, it must be an issue with the configuration of your database.

Mysql 8.0 seems to work in UTF-8 by default, but prior versions, 5.7, works with Latin1 (= ISO-8859-1) by default, and special steps need to be taken to work with UTF-8. See: https://dev.mysql.com/doc/refman/5.7/en/charset-applications.html

Also, be sure to use the latest available JDBC drivers compatible with your database version.

areus
  • 2,880
  • 2
  • 7
  • 17
  • Ok for the System.out.println. – stefano errani Feb 28 '20 at 19:06
  • In order to set the encoding on request and response on servlet, the encoding of a filter and page on jsp I have setted all as your indicated. Idem for database, tables and relative jdbc connection. I have also write the code as your indicated in web.xml of the application and of the Tomcat (global). – stefano errani Feb 28 '20 at 19:07
  • My problem occurs only in a servlet and not in jsp page where all text languages are displayed correctly. – stefano errani Feb 28 '20 at 19:07
  • I've tried another solution writing a text in russian and chinese languages in a txt file on a server using a jspbean, but in it the characters are written wrong. – stefano errani Feb 28 '20 at 19:07
  • The thing that I understood is that all other lnaguages are coded correctly (example the tilde caracters of spanish) but not in chinese and russian. – stefano errani Feb 28 '20 at 19:08
  • Characters in spanish are supported by iso-8859-1. My bet is that your database stores text in Latin1/ISO-8859-1 and not UTF-8 – areus Feb 28 '20 at 19:32
  • Can you execute the following on your database?: `USE db_name; SELECT @@character_set_database, @@collation_database;` – areus Feb 28 '20 at 20:15
  • Really the tables and fields was utf8/utf8_general_ci and the db was latin1/latin1_swedish_ci. Now I've setted the db at utf8/utf8_general_ci. Trying the save of the russian and chinese text on db by servlet nothing is changed. – stefano errani Feb 29 '20 at 09:49
  • Check the characterEncoding parameter on your JDBC connection URL. See: https://dev.mysql.com/doc/connector-j/5.1/en/connector-j-reference-charsets.html – areus Feb 29 '20 at 14:15
  • jdbc connection characterEncoding is at UTF-8. – stefano errani Feb 29 '20 at 16:58
  • It's an X File. Running out of ideas. This? https://stackoverflow.com/a/33669691 – areus Feb 29 '20 at 17:43
  • Are you doing something inapropiate on your code, like calling `getBytes()` on a String? – areus Feb 29 '20 at 17:45
  • 1
    Yeahhh!!!! I found the solution!!! In my.ini in recent days I've setted at utf8 default-character-set and not character-set-server. Best regards and good work. Stefano Errani – stefano errani Mar 02 '20 at 15:09
  • Seems that some mysql versions are tricky to setup for UTF-8. Happy you found out. – areus Mar 02 '20 at 15:46
0

Good day.

Two questions: 1) the doFilter method must to be added in a servlet that I read the request parameters in russian and chinese languages? 2) In web.xml the filter encoding class must to be the above servlet? I've to do the same encoding on other servlets, so I have to add in the web.xml all servlets on which I apply filters?

Best regards and good work.

Stefano Errani

  • 1. The doFilter, must be in a class implementing the javax.servlet.Filter interface. 2. The order is important for other filters, not for servlets. There must be a filter-mapping element (i added it to the answer), with it you can map the filter to a url-pattern, or to a servlet-name so the filter is only applied to that servlet. You can see examples using filters here: https://www.codejava.net/java-ee/servlet/how-to-create-java-servlet-filter – areus Feb 27 '20 at 10:49
  • p.d: please note that discussions about an answer should take place on the comments section of the answer. – areus Feb 27 '20 at 10:51
  • Following your indications and those present in the link indicated, I created the class that implements the Filter and put it in the web.xml mapping it instead of the url-pattern the servlets I should use. In the xml the filter tag and its filter-mapping I put them at the beginning of the file. Only that despite seeing (via system.out.println) that before running the servlet goes into the filter class, the request parameters with text in russian and chinese count to replace the cyrillic and chinese characters with question marks. Any other ideas? Best regards. Stefano Errani – stefano errani Feb 27 '20 at 19:01
  • Test two things. 1. The method used in your `form` is POST or GET? 2. Check the "page info" on your browser when viewing the form. – areus Feb 27 '20 at 19:15
  • The method of form is in post; too many parameters for get method. Which info I must to check? Best regards. Stefano Errani. – stefano errani Feb 27 '20 at 21:49
  • Usually in the browser you can view the page info to make sure that the content is really UTF-8. But I think we are looking at a wrong issue. We are assuming that the servlet is not reading the characters correctly because strange characters appear on System.out.println, and because they are not stored property on the database. But, console output usually doesn't support UTF-8, in windows its CP1252 by default (nearly the same of ISO-8859-1), and may be the MySQL is not configured to store UTF-8, most installations use ISO-8859-1 encoding by default – areus Feb 28 '20 at 11:32
  • In my browser the russian and chinese text are displayed correctly (for example, in the application when I click a button a new jsp page is opened where I pass in get method the text in russian or chinese and is displayed correctly). I've understood that on Windows console System.out.println not print the string in UTF-8 and ok. The servlet is configurated as you indicated within the filter class. I'm sure that the database and relatives tables and fields are in utf8/utf_general_ci. Idem the jsbc connection that is econded on UTF-8. – stefano errani Mar 01 '20 at 21:27
  • Now in Italy is later and I'm a little tired, but tomorrow I should want to verify again all the configuration as you are indicated in previous comments. – stefano errani Mar 01 '20 at 21:27