Tomcat 7 implements the specs Servlet (3.0) and JSP (2.2). In those specs there are some places where encoding is relevant, and the defined default encoding is ISO-8859-1.
If you want the enduser to be able to input text in UTF-8 in your webapp, and received it correctly to store it in a database you have to take some steps.
The html page where the <form>
resides must be encoded in UTF-8
If the page is generated by a Servlet, before calling getWriter
have to call
response.setContentType("text/html; charset=UTF-8");
Or just:
response.setCharacterEncoding("UTF-8");
As the Servlet spec here states that:
If the servlet does not specify a character encoding before the
getWriter method of the ServletResponse interface is called or the
response is committed, the default ISO-8859-1 is used.
You can read section 5.4 of the spec for more info about it. For example you can set an econding based on the locale.
If the html is generated by a JSP page, the rules for the response character encoding are determined at the section 4.2 of the JSP spec:
The initial response character encoding is set to the CHARSET value of
the contentType attribute of the page directive. If the page doesn’t
provide this attribute or the attribute doesn’t have a CHARSET value,
the initial response character encoding is determined as follows:
• For documents in XML syntax, it is UTF-8.
• For JSP pages in standard
syntax, it is the character encoding specified by the BOM, by the
pageEncoding attribute of the page directive, or by a JSP
configuration element page-encoding whose URL pattern matches the
page. Only the character encoding specified for the requested page is
used; the encodings of files included via the include directive are
not taken into consideration. If there’s no such specification, no
initial response character encoding is passed to
ServletResponse.setContentType() - the ServletResponse object’s
default, ISO-8859-1, is used.
So, you can set it including
<%@ page contentType="text/html; charset=UTF-8" %>
on your JSP page generating the form. Note that pageEncoding is also necessary if in your JSP page there are UTF-8 encoded characters on verbatim text.
A convenient way to set common attributes for all the pages in your web app is using a jsp-property-group, including this config on you web.xml
<jsp-config>
<jsp-property-group>
<description>Apply to all JSPs</description>
<url-pattern>*.jsp</url-pattern>
<page-encoding>UTF-8</page-encoding>
<default-content-type>text/html; charset=UTF-8</default-content-type>
</jsp-property-group>
</jsp-config>
The request submitted must be read in UTF-8
On section 3.10 the servlet spec states that:
Currently, many browsers do not send a char encoding qualifier with
the Content- Type header, leaving open the determination of the
character encoding for reading HTTP requests. The default encoding of
a request the container uses to create the request reader and parse
POST data must be “ISO-8859-1” if none has been specified by the
client request. However, in order to indicate to the developer, in
this case, the failure of the client to send a character encoding, the
container returns null from the getCharacterEncoding method.
If the client hasn’t set character encoding and the request data is
encoded with a different encoding than the default as described above,
breakage can occur. To remedy this situation, a new method
setCharacterEncoding(String enc) has been added to the ServletRequest
interface. Developers can override the character encoding supplied by
the container by calling this method. It must be called prior to
parsing any post data or reading any input from the request. Calling
this method once data has been read will not affect the encoding.
So you need to set request.setCharacterEncoding("UTF-8")
is called before any access is done to the request content.
The best way is to implement a filter to set the character encoding if it hasn't been already set:
public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain) throws ServletException {
if (request.getCharacterEncoding() == null) {
request.setCharacterEncoding("UTF-8");
}
chain.doFilter(request, response);
}
And declare the filter at the beginning of your web.xml (yes, the order is important) with something like this:
<filter>
<filter-name>Character Encoding Filter</filter-name>
<filter-class>yourpackage.YourCharacterEncodingFilter</filter-class>
</filter>
<filter-mapping>
<filter-name>Character Encoding Filter</filter-name>
<url-pattern>/*</url-pattern>
</filter-mapping>
This way this filter applies to all the requests in the first place before any other filters, so we are sure that the request data hasn't been accessed.
You can change the <url-pattern>
element, for a <servlet-name>
to apply the filter only to one servlet.
Note that this only applies to POST requests. For GET request, Tomcat 7 uses ISO-8859-1 by default to decode % encoded URI bytes. This can be overrided adding the URIEconding attribute on <Connector>
element on the server.xml
file, as stated in the Tomcat 7 docs. https://tomcat.apache.org/tomcat-7.0-doc/config/http.html#Common_Attributes
How do I known the parameters are read correctly?
The best way to ensure your webapp is reading the parameters correctly is to write a response from the servlet, encoded in UTF-8, and see who the parameters are printed on your browser.
You could do something like this in your servlet:
response.setContentType("text/html; charset=UTF-8");
PrintWriter writer = response.getWriter();
writer.println("<html><body>");
writer.println("UTF-8 encoded parameter: " + request.getParameter("yourparam");
writer.println("</body></html>");
You cannot rely on text printed with System.out.println
to the console, because for example, in Windows the default encoding of the console is CP1252, that is nearly the same of ISO-8859-1.
So if you print on the console UTF-8 chars that are not supported by CP1252 you will see gibberish or question marks on your console. (To change the encoding of the console on windows see this for example: https://superuser.com/questions/269818/change-default-code-page-of-windows-console-to-utf-8)
The Webapp reads and writes UTF-8 text, but it's not stored on the database
If all the above works, but you still you can't store UTF-8 on your database, it must be an issue with the configuration of your database.
Mysql 8.0 seems to work in UTF-8 by default, but prior versions, 5.7, works with Latin1 (= ISO-8859-1) by default, and special steps need to be taken to work with UTF-8. See: https://dev.mysql.com/doc/refman/5.7/en/charset-applications.html
Also, be sure to use the latest available JDBC drivers compatible with your database version.