0

I'm sending مرحبا characters from JSP to servlet and in servlet i'm receiving the characters in this format ÙØ±Ø­Ø¨Ø§. i want to know which one is converting this and which encoidng they are using.

If i pass these characters in POST Method, i'll receive data as it is.

i'm using JDK 1.6, Tomcat 7.

This is the JSP.

<%@page language="java" pageEncoding="UTF-8" contentType="text/html; charset=UTF-8"%><!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><title>Insert title here</title></head><body><form method="get" action="register">
    Name:<input type="text" name="userName"/><br/><br/> 
    <input type="submit" value="SUBMIT"/>
</form></body></html>

This is the servlet.

public class Register extends HttpServlet { 
 public void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {     
    try {
        System.out.println("System: file.endcoding=" + System.getProperty("file.encoding"));

        String str= request.getParameter("userName").trim();    
        RequestDispatcher rd = request.getRequestDispatcher("Display.jsp");
        System.out.println("before encoding and decoding the string : "+str);
        request.setAttribute("beforeconvert",str);}catch(Exception e){}}

1 Answers1

3

It is more or less an UTF8 encoded string that is decoded as Latin1.

Original string "مرحبا" is composed of characters having the following Unicode code points: '0x645', '0x631', '0x62d', '0x628', '0x627'

When encoded as UTF8, it gives: '\xd9\x85\xd8\xb1\xd8\xad\xd8\xa8\xd8\xa7' which when interpreted as Latin1 gives: "ÙØ±Ø­Ø¨Ø§". The character '\x85' is normally non printable in Latin1. But it is enough for the string you gave in your question to be non decodable as UTF8.

As you say you use a POST method, you should be able to declare in the form that the data will be UTF8 encoded. Normally <form accept-charset="UTF8" ...> should be enough

In a GET request, there is no way to specify any encoding. You must decide how you will want to interpret them. You have to ways to do that:

  • explicitely at application level:

    Charset u8 = Charset.forName("UTF-8");
    Charset l1 = Charset.forName("ISO-8859-1");
    String utf8String = u8.decode(l1.encode(str)).toString();
    
  • or ask the servlet container to do it for you. For Tomcat, you can set the URIEncoding attribute on the <Connector> element in server.xml to the expected charset: URIEncoding="UTF-8". refs

Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
  • why it gets converted like that for GET Request? – Ganesh Bhagavath Dec 20 '16 at 09:51
  • and how to get UTF-8 Characters in Tomcat 7 without adding URI-Encoding in server.xml. – Ganesh Bhagavath Dec 20 '16 at 10:09
  • for GET Request – Ganesh Bhagavath Dec 20 '16 at 10:12
  • That helps a lot, Thanks Serge Ballesta – Ganesh Bhagavath Dec 20 '16 at 11:14
  • Hi Serge Ballesta... I'm Passing مرحبا character from JSP and in servlet using getParameter i'm capturing value and printing the value. It is in this format, 'مرحبا' Now i want to convert it into UTF-16 String and store it in DB. And when i retrieve from DB, again i want to convert it to that HTML code. I'm using JSP and page encoding is ISO-8859-1, Servlet, Tomcat 7 and it is GET Request. Can u help me? – Ganesh Bhagavath Dec 21 '16 at 10:25
  • @GaneshBhagavath: search SO for [java] html entities for more info, but at least those 2 questions should help http://stackoverflow.com/q/994331/3545273 and http://stackoverflow.com/q/10978098/3545273 – Serge Ballesta Dec 21 '16 at 11:12