-1

Given byte[] peek where peek is N bytes from a text file, how can I determine if peek is XML?

Is it enough to just check for a < in the start of the string?

Nicholas DiPiazza
  • 10,029
  • 11
  • 83
  • 152

2 Answers2

2

To determine, does given string have XML format, you need a parser (for Java, read this). This is the only way to get exact answer.

Checking first few bytes, in order to find <?xml only gives you assumptions, whether it is valid XML. But you cannot be absolutely sure until you parse it to the end.

Community
  • 1
  • 1
augur
  • 236
  • 1
  • 10
  • I only have `N` bytes. and most of the xml files in my use case do not have `` they would just start like this `` – Nicholas DiPiazza Oct 09 '16 at 15:47
  • All I have so far is: `String peekStr = new String(peek); peekStr.contains(" – Nicholas DiPiazza Oct 09 '16 at 15:50
  • @NicholasDiPiazza no matter how long string is, you still need to read it all to validate it as XML. If you read only first bytes and see ``, you can only say "okay, in the beginning it looks like XML". Beyond these few bytes there may be any malformed data. This type of check isn't safe. – augur Oct 09 '16 at 15:58
  • @NicholasDiPiazza I highly recommend to use full-qualified parser for your problem. This is more healthy way of program design, and I think, performance cost would be negligible. – augur Oct 09 '16 at 16:04
  • A parser won't answer the 'small N' question, so I'm not sure you're helping by suggesting it. – bmargulies Oct 09 '16 at 16:21
  • @bmargulies, on 'small N', answer would be "there is no way to guarantee this is XML". If we reduce question to "having limited access only to few first bytes, check if string is **NOT** XML", then yes, code above does something useful. We don't know the problem context, however, it is likely enough, these limitations are result of poor design, and answering straightly to this question, we only show author how to shoot himself in the foot more precisely. – augur Oct 09 '16 at 16:52
2

According to the XML standard, your files should use <?xml to make it possible to tell if they are XML. If you have chosen not to follow that recommendation, there is no reliable way to tell. Some non-XML files will pass any test (by starting with <) that looks at small-N bytes. Others won't. Also note that a valid XML file may begin with a Unicode BOM character, so be sure to take that into account if you are going to go ahead and try this.

bmargulies
  • 97,814
  • 39
  • 186
  • 310