1

I am writing a parser that reads every Unicode character in a JSON stream and outputs XML accordingly. For the most part JSON is easily convertible to XML, however, while JSON object keys can contain pretty much anything:

{
  "100": "valid",
  "٢٢": "valid",
  "0x8F": "valid",
  "3.14": "valid",
  "2alpha": "valid",
  "$@!": "valid",
  "Europ€": "valid",
  " ": "valid",
  "tag name": "valid"
}

However, this is not the case for XML element/tag names:

<root>
  <100>invalid</100>
  <٢٢>invalid</٢٢>
  <0x8F>invalid</0x8F>
  <3.14>invalid</3.14>
  <2alpha>invalid</2alpha>
  <$@!>invalid</$@!>
  <Europ€>invalid</Europ€>
  < >invalid</ >
  <->invalid</->
  <.>invalid</.>
  <tag name>invalid</tag name>
</root>

The following, however, IS valid:

<root>
  <_->valid</_->
  <_.>valid</_.>
  <éÞäğı>valid</éÞäğı>
  <Ë231>valid</Ë231>
  <გამარჯობა>valid</გამარჯობა>
  <สวัสดี>valid</สวัสดี>
  <你好>valid</你好>
</root>

and probably even this:

<root>
  <سلام>probably valid</سلام>
  <שָׁלוֹם>probably valid</שָׁלוֹם>
</root>

I say probably in the last example because one of the online validators I have used considers the tags with RTL tags malformed, while all the others consider it to be valid. I expect this is due to the limitation of that particular validator. Personally I found experimenting to be easier than trying to understand the XML specs. What I have gleaned from my experimentation is the following:

Any letter, regardless of language, is valid anywhere inside the element name, as is the underscore (_) character. Numerals (regardless of language) and some punctuation (.-) is valid after the first character but invalid at the start, symbols ($@#₺€...) and most other punctuation (!?,;…) are invalid regardless of their position.

Since this is quite complex, I need two functions:

public function charValidInElementName(string $char): bool;
public function charValidInElementStart(string $char): bool;

I was wondering if anyone knows if any such functions are available by default in PHP or failing that if there is a version of ctype_alpha() that returns true for all letters and not just English a-zA-Z, or has already written similar functions.

kaan_a
  • 3,503
  • 1
  • 28
  • 52
  • Not sure if https://stackoverflow.com/questions/2519845/how-to-check-if-string-is-a-valid-xml-element-name deals with all the instance you are after. – Nigel Ren May 18 '20 at 12:38
  • 1
    "Personally I found experimenting to be easier than trying to understand the XML specs." Welcome to many happy weeks of experimenting before you get it right. Finding it in the specs takes about 10 minutes. – Michael Kay May 18 '20 at 14:03
  • @NigelRen I suppose it does, it's not exactly the same question, since that one is asking how to determine if the whole string is a valid XML element name, vs what I asked which is whether a character is valid at the start or within an element name. But there is enough there that I can figure it out. I'd want to do a benchmark to test the speed difference between creating a DomElement vs regex – kaan_a May 18 '20 at 14:24

0 Answers0