4

I have a component that should be able to parse and process any xml file given by a user. The xml file can contain Timestamp values like "12 March 2012 05:00 pm". So the user has to give the Timestamp pattern that is acceptable to SimpleDataFormat. We use the pattern and the SimpleDateFormat to parse the Timestamp values like this:

 SimpleDateFormat sdt = new SimpleDateFormat(inputTimestampPattern);
 Date date = sdt.parse(inputTimestampString);

But we are getting ParseException like below for one specific file.

java.text.ParseException: Unparseable date: " 04-6\u57d6 -12 18.54:57.169000 \u548c\u601c"

We got this exception when we ran the component in Japanese locale with an input file Containing Timestamp pattern in Chinese locale. The JVM's locale is Japanese, so the SimpleDateFormat tries to parse the timestamp string assuming Japanese Locale and fails. The xml file has the encoding information like this:

  <?xml version="1.0" encoding="gbk"?>

If we somehow figure out the Locale from the encoding value then we can create Locale sensitive SimpleDateFormat object which would fix this issue. So my question is can we get Locale information from the encoding? I'm not asking for the exact Locale. Even if there is a way to get small set of possible Locales given an encoding, I can try all of them until one of them doesn't throw the Exception. Is there any API in Java that helps here?

Or is there any better way to address this issue?

Anders R. Bystrup
  • 15,729
  • 10
  • 59
  • 55
madhusudhan
  • 337
  • 1
  • 5
  • 12
  • This looks like quite a tricky thing to do http://stackoverflow.com/questions/3389348/parse-any-date-in-java – Ross Drew Nov 25 '13 at 12:37
  • "So the user has to give the Timestamp pattern that is acceptable to SimpleDataFormat" why bother with that complexity? Why not just require standard ISO format. XML is not very good as a user-interface, so why bother putting lip-stick on a pig? – Raedwald Nov 25 '13 at 13:00
  • 1
    "figure out the Locale from the encoding" this is impossible. What is the locale for UTF-8? The locale for US-ASCII is not US; it might be UK. – Raedwald Nov 25 '13 at 13:02
  • "why bother with that complexity?" We need that complexity is because our component is a xml log parser. The xml files could be logs generated from random source (that we may not even be aware of) and we should be able to process them. – madhusudhan Nov 26 '13 at 06:17
  • Take a look at this post : http://stackoverflow.com/questions/9429037/getting-encoding-type-of-a-xml-in-java there is a suggestion to use the 'getEncoding()' method.... – monojohnny Feb 17 '14 at 21:38

1 Answers1

0

If the encoding will set in the first line of XML you can read the file first, obtaining only the first line, so will will catch the "encoding="gbk"" or whatever. And the set the encoding in the program with a Switch-case or however you want

Elorry
  • 33
  • 7
  • "set the encoding in the program with a Switch-case..." Did you mean and "set the Locale in the program with a Switch-case"? If yes, I would like know the mapping from encoding to Locale. I can hard code it. Can you tell me where to find this mapping information? Also, it may not be a comprehensive list. Say if the encoding is UTF-8, we may not have corresponding Locale. So we are fine to use JVM's locale in this case. – madhusudhan Nov 26 '13 at 06:03