0

I am reading an XML file using Scanner scanner = new Scanner(inputStream, "UTF-8"); and then going line by line using scanner.nextLine().

I have some UTF-8 type text in the XML file that I read and it works perfectly when I run my app locally through the Jetty server in my Eclipse Helios IDE.

However when the app is built and deployed on a tomcat server that we use as out dev server, the UTF-8 characters appear as '?' everywhere. When I put some logs in place I found that the characters were being read that way in spite of me mentioning UTF-8 when I initialize the scanner.

I am unable to understand why it would work locally for me but not when I deploy it on tomcat.

I am sure so many might have come across this before.

Karthik
  • 136
  • 2
  • 11
  • Why are you reading XML line by line instead of getting an XML parser to do it? – Jon Skeet Jul 17 '12 at 20:25
  • I could try that option but I wanted to find out about this since I already went down this lane. Also, I am interested in finding out why it would work locally for me but not on a remote server. – Karthik Jul 17 '12 at 20:28
  • My guess is that it's not *really* a UTF-8 file on your remote server. It may have been corrupted along the way. – Jon Skeet Jul 17 '12 at 20:30
  • It has the standard listed as at the beginning. – Karthik Jul 17 '12 at 20:32
  • That doesn't *really* mean it's UTF-8. Imagine it was corrupted in transfer somehow, e.g. someone saving it as the wrong encoding as a text file, or fetching it via FTP in text mode. The file could say UTF-8 but not actually *be* in UTF-8. If this is the same file you were testing with locally, just the MD5 sum (and length) of both files. – Jon Skeet Jul 17 '12 at 20:33
  • hmmm... can you elaborate on that? I have no idea about MD5 sum – Karthik Jul 17 '12 at 20:33
  • Well, find *any* MD5 tool appropriate for the operating systems on your local and remote machines. Do a quick search for one. Run it on the file, and compare the results. It's a simple way of seeing whether they're (almost certainly) the same or not, on a binary basis. – Jon Skeet Jul 17 '12 at 20:34
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/14028/discussion-between-karthik-and-jon-skeet) – Karthik Jul 17 '12 at 21:17

1 Answers1

2

Are you sure you have tomcat configured to display utf-8?

Have you configured the page displaying it? There is a good how to here How to get UTF-8 working in Java webapps?

Also, have you set the default file encoding to utf-8 in catalina.sh?

-Dfile.encoding=UTF-8"

http://www.redleopard.com/2008/12/utf-8-on-tomcat/

I wouldn't expect it to log utf-8 properly without configuring it.

Community
  • 1
  • 1
user1258245
  • 3,639
  • 2
  • 18
  • 23
  • The page that displays it is an html file and it has been configured to display UTF-8. I still have to look at tomcat and see if it has been configured to handle it at all. Locally my tomcat does. – Karthik Jul 18 '12 at 14:55