0

My application is malfunctioning because of the special characters in the strings any many areas.

Eg 1 : you can see the ? character that was displaying instead of ’.

Text :
The Hilton Paris La Defense hotel is located at the foot of the Grande Arche at the very heart of Europe’s largest business district and puts you in easy reach of some of Paris’ most famous attractions. Only a few minutes from the...

Screen Shot :
enter image description here

Eg 2 : Parser exception while parsing a XML having special characters (like ’,& etc) using AXIOM.

XMLStreamReader parser = XMLInputFactory.newInstance().createXMLStreamReader(new StringBufferInputStream(responseXML));
OMElement documentElement = new StAXOMBuilder(parser).getDocumentElement();

I found many posts to remove them when they are found. Eg : How to remove bad characters that are not suitable for utf8 encoding in MySQL? remove non-UTF-8 characters from xml with declared encoding=utf-8 - Java

And I'm using following character to remove the non UTF compliant characters characters.

if (null == inString ) return null;

byte[] byteArr = inString.getBytes();

for ( int i=0; i < byteArr.length; i++ ) {
   byte ch= byteArr[i]; 
   if ( !(ch < 0x00FD && ch > 0x001F) || ch =='&' || ch=='#') {
      byteArr[i]=' ';
   }
}

return new String( byteArr );

But this lead to another problem of removing some informative characters like ’.

What I want to do is, I want to replace them in a meaningful way rather than simply removing them. Eg : ’ can be replaced by ', & can be replaced by 'and' etc. Is there any standard way to do this rather than manually replacing one by one?

Community
  • 1
  • 1
ironwood
  • 8,936
  • 15
  • 65
  • 114
  • 1
    What exactly do you mean by "non UTF compliant characters"? I'm not aware of such a term. Additionally, you're using the platform-default encoding when constructing a string, which seems like a very bad idea to me. What are you *really* trying to achieve, and what encoding is your input really in? – Jon Skeet Jun 24 '13 at 06:23
  • @oneliner, please add some example text (input + expected output + current output). – pepuch Jun 24 '13 at 06:32
  • Well, this is using in many ways in my application. Eg : I have a program to get some information by reading a xml file. But that XML didn't parse through a XML parser (AXIOM) since these like special characters are in there. In another scenario, there's a text value in the database having this like special characters. They cause some display issues (displays as '?') in the front end web page. – ironwood Jun 24 '13 at 06:36
  • Maybe you should convert text to different encoding instead of remove? – pepuch Jun 24 '13 at 06:37
  • You have a classic XY problem. You have problem X, and you think the best way to solve it is Y, so you ask about Y instead of asking about X. (See http://meta.stackexchange.com/questions/66377/what-is-the-xy-problem) Instead show us the code you've written to read your XML-file and we can tell you what's wrong with it. – Christoffer Hammarström Jun 24 '13 at 06:53
  • @Christoffer : Issues with non-utf complaint characters has been a general problem for many developers. (Not only for me. Check : http://stackoverflow.com/questions/13657019/how-to-remove-bad-characters-that-are-not-suitable-for-utf8-encoding-in-mysql and http://stackoverflow.com/questions/2869072/remove-non-utf-8-characters-from-xml-with-declared-encoding-utf-8-java) That's why I just asked it. Any way If you want the codes here I have added them in the question – ironwood Jun 24 '13 at 07:51
  • @oneliner: Thousands of developers struggle with character encoding issues every day. There are many character encoding experts on StackOverflow that can help you, me for example. Show us the code you use to read the XML file, and we'll tell you why you're not reading the characters correctly. I don't see it in your question. – Christoffer Hammarström Jun 24 '13 at 08:07
  • @Christoffer : I just added it through my edit now. Thank you for your effort for this Christoffer. Appology if my question is not clear to you. I tried to make it more clear. – ironwood Jun 24 '13 at 08:10
  • Where are you reading the XML from? A file? In that case use FileInputStream. – Christoffer Hammarström Jun 24 '13 at 08:11

1 Answers1

1

The javadoc for StringBufferInputStream says

Deprecated. This class does not properly convert characters into bytes. As of JDK 1.1, the preferred way to create a stream from a string is via the StringReader class.

Don't use it.

The file is read as bytes, no matter where it comes from. Never convert your data to a String if you need it as bytes in the first place.

If you're reading from a file, use a FileInputStream. (Never use FileReader, since it doesn't allow you to specify the encoding.)

Christoffer Hammarström
  • 27,242
  • 4
  • 49
  • 58
  • I checked it Christopher. The issue you pointed out was there. Thank you. But it is not related to this special character issue. Although I corrected the StringReader issue, it below error when having non-utf complient characters like &. org.apache.axiom.om.OMException: [com.ctc.wstx.exc.WstxLazyException] com.ctc.wstx.exc.WstxUnexpectedCharException: Unexpected character ' ' (code 32) (missing name?) at [row,col {unknown-source}]: [1,6177] – ironwood Jun 24 '13 at 08:58
  • @oneliner: First, you still haven't explained what you mean by "non-UTF compliant". There is no such thing, and '&' can certainly be encoded in any UTF-encoding. – Christoffer Hammarström Jun 24 '13 at 09:31
  • @oneliner: Second, you still haven't showed the code you use to actually read the XML file. Please do so. – Christoffer Hammarström Jun 24 '13 at 09:32
  • @oneliner: Third, also provide a minimal XML file that shows the problem, so we can see that your file isn't actually corrupted. – Christoffer Hammarström Jun 24 '13 at 09:33
  • @oneliner: Fourth: I hope you're still not using the broken code you provided above that puts in random spaces. If you are, then that's your problem right there. **Do not do any replacing on the XML data, or remove any characters from it.** – Christoffer Hammarström Jun 24 '13 at 09:37