I am writing a parser that reads every Unicode character in a JSON stream and outputs XML accordingly. For the most part JSON is easily convertible to XML, however, while JSON object keys can contain pretty much anything:
{
"100": "valid",
"٢٢": "valid",
"0x8F": "valid",
"3.14": "valid",
"2alpha": "valid",
"$@!": "valid",
"Europ€": "valid",
" ": "valid",
"tag name": "valid"
}
However, this is not the case for XML element/tag names:
<root>
<100>invalid</100>
<٢٢>invalid</٢٢>
<0x8F>invalid</0x8F>
<3.14>invalid</3.14>
<2alpha>invalid</2alpha>
<$@!>invalid</$@!>
<Europ€>invalid</Europ€>
< >invalid</ >
<->invalid</->
<.>invalid</.>
<tag name>invalid</tag name>
</root>
The following, however, IS valid:
<root>
<_->valid</_->
<_.>valid</_.>
<éÞäğı>valid</éÞäğı>
<Ë231>valid</Ë231>
<გამარჯობა>valid</გამარჯობა>
<สวัสดี>valid</สวัสดี>
<你好>valid</你好>
</root>
and probably even this:
<root>
<سلام>probably valid</سلام>
<שָׁלוֹם>probably valid</שָׁלוֹם>
</root>
I say probably in the last example because one of the online validators I have used considers the tags with RTL tags malformed, while all the others consider it to be valid. I expect this is due to the limitation of that particular validator. Personally I found experimenting to be easier than trying to understand the XML specs. What I have gleaned from my experimentation is the following:
Any letter, regardless of language, is valid anywhere inside the element name, as is the underscore (_) character. Numerals (regardless of language) and some punctuation (.-) is valid after the first character but invalid at the start, symbols ($@#₺€...) and most other punctuation (!?,;…) are invalid regardless of their position.
Since this is quite complex, I need two functions:
public function charValidInElementName(string $char): bool;
public function charValidInElementStart(string $char): bool;
I was wondering if anyone knows if any such functions are available by default in PHP or failing that if there is a version of ctype_alpha()
that returns true for all letters and not just English a-zA-Z, or has already written similar functions.