0

I have a problem, where when the end user submits the data from HTML form in a web application, they are copying the data from Word document which contains long dash or em dash.

As per the logic we are trying to read those data from database and writing it to an excel file.

As an outcome those characters are generated in the excel as shown below, which contains a kind of question mark.

  Actual output : 1993 � 1995
Expected output : 1993 – 1995 

I have done the UTF-8 encoding in Java but still getting the same output in the excel. How to solve this?

Below is the extract of my code.

try {
        keyStrenghts = new String(keyStrenghts.getBytes("utf-8"));
        } catch (UnsupportedEncodingException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }

I am using JDK 6 and apache poi to generate the excel file.

prabu
  • 1,247
  • 7
  • 21
  • 33

2 Answers2

1

This might solve your problem if it is limited to em dashes:

keyStrenghts = keyStrenghts.replaceAll("\\p{Pd}", "-");

This is using a regex to replace all the dashes with ascii "-" as stated here.

Community
  • 1
  • 1
hack_on
  • 2,532
  • 4
  • 26
  • 30
  • 1
    As per the link it should work, but its not working, even in eclipse console the em dash is not getting printed and appears only as question mark. Any idea? – prabu Mar 02 '17 at 08:15
  • 1
    The problem might not be what you expect -- driver is corrupting on way to database, way back from database, or it is not actually the character that you think it is. Try to prove which of your assumptions is false by say connecting to the database using command line tool that support UTF-8 and displaying it. Then determine what the unicode sequence is coming back into java. – hack_on Mar 02 '17 at 08:19
  • Let me give more details, the data has been copy pasted from word document to the HTML form, upon submission the data is saved to database. Now if we open the submitted form again from the web application the data appears to be fine. So the point is it can be viewed in the frontend but could not read from the database and written to an excel at backend. – prabu Mar 02 '17 at 08:25
0

Unicode for � is: \uFFFD

keyStrenghts = "1993 � 1995";
if(keyStrenghts.contains("\uFFFD")){
   keyStrenghts = keyStrenghts.replace("\uFFFD","-");
}

Now if you print keyStrenghts you will get: 1993 – 1995

George Ninan
  • 1,997
  • 2
  • 13
  • 8