How do I use xpath on Unicode's XML Unicode Character Database (UCD/UCDXML)?

Question

I'm trying to write a script to build a few tables for my Unicode library, one of the tables I need to build is of a list of all of the numeric codepoints in the Unicode standard, with their values.

To do that, I'm using xmllint in a shell script, I'm stuck on building up the list of codepoints and their values because my xpath query isn't working.

I've tried customizing a bunch of xpath query strings I've seen here at stackoverflow on a bunch of other questions.

here's the current query string I'm trying to use: ucd/repertoire/char/@nv[.!='NaN'] and xmllint --xpath /*[local-name()='ucd']/*[local-name()='repertoire']/*[local-name()='char']/@nv

Here's an example codepoint so you can see it's layout.

<ucd xmlns="http://www.unicode.org/ns/2003/ucd/1.0">
    <description>Unicode 10.0.0</description>
    <repertoire>
        <char cp="0000" age="1.1" na="" JSN="" gc="Cc" ccc="0" dt="none" dm="#" nt="None" nv="NaN" bc="BN" bpt="n" bpb="#" Bidi_M="N" bmg="" suc="#" slc="#" stc="#" uc="#" lc="#" tc="#" scf="#" cf="#" jt="U" jg="No_Joining_Group" ea="N" lb="CM" sc="Zyyy" scx="Zyyy" Dash="N" WSpace="N" Hyphen="N" QMark="N" Radical="N" Ideo="N" UIdeo="N" IDSB="N" IDST="N" hst="NA" DI="N" ODI="N" Alpha="N" OAlpha="N" Upper="N" OUpper="N" Lower="N" OLower="N" Math="N" OMath="N" Hex="N" AHex="N" NChar="N" VS="N" Bidi_C="N" Join_C="N" Gr_Base="N" Gr_Ext="N" OGr_Ext="N" Gr_Link="N" STerm="N" Ext="N" Term="N" Dia="N" Dep="N" IDS="N" OIDS="N" XIDS="N" IDC="N" OIDC="N" XIDC="N" SD="N" LOE="N" Pat_WS="N" Pat_Syn="N" GCB="CN" WB="XX" SB="XX" CE="N" Comp_Ex="N" NFC_QC="Y" NFD_QC="Y" NFKC_QC="Y" NFKD_QC="Y" XO_NFC="N" XO_NFD="N" XO_NFKC="N" XO_NFKD="N" FC_NFKC="#" CI="N" Cased="N" CWCF="N" CWCM="N" CWKCF="N" CWL="N" CWT="N" CWU="N" NFKC_CF="#" InSC="Other" InPC="NA" PCM="N" vo="R" RI="N" blk="ASCII" isc="" na1="NULL">
            <name-alias alias="NUL" type="abbreviation"/>
            <name-alias alias="NULL" type="control"/>
        </char>
    </repertoire>
</ucd>

I'm trying to check each char to see if it's nv attribute equals a valid number by telling it to ignore "NaN", if it's attribute is anything but NaN, I assume it's valid, grab the cp value, and it's nv value and put them into a C table, but I haven't really gotten to the higher level scripting parts yet, I'm stuck on the xpath portion.

So, where am I going wrong? I've tried all kinds of different versions that search for the structured version (by that I mean //ucd/repertoire/char, ucd/repertoire/char, just the //char@nv, and a bunch of other versions I don't even remember).

@kjhughes it's not a duplicate, I'm asking how to parse Unicode's XML Unicode Character Database, you're talking about using Unicode characters in XPath queries. — MarcusJ, Jan 26 '18 at 14:25
Are you not using XPath? You say you are: *I've tried customizing a bunch of xpath query strings*. — kjhughes, Jan 26 '18 at 14:28
and? how is that relevent? this is about parsing the UCD, not about using Unicode in xpath queries — MarcusJ, Jan 26 '18 at 14:29
Is the UCD not XML? You show it as such. So, unless I'm misunderstanding, you are using XPath to extract information from namespaced XML, but you're failing to account for the namespace in your XPath expressions. — kjhughes, Jan 26 '18 at 14:30
Ok that my get me somewhere, can you use a namespace from a shell script? the linked post describes using xpath in python, vba, etc, but not old fashioned shell scripts, and I assume the namespace you're referring to is `ucd`, is that right? — MarcusJ, Jan 26 '18 at 14:33
xmllint has no way to declare an XML namespace. Use xmlstarlet, python, or perl from a shell script instead. — kjhughes, Jan 26 '18 at 14:35
...and, no, `ucd` is an element, which is in the `http://www.unicode.org/ns/2003/ucd/1.0` namespace by virtue of the default namespace declaration (`xmlns="..."`) on it. — kjhughes, Jan 26 '18 at 17:49

How do I use xpath on Unicode's XML Unicode Character Database (UCD/UCDXML)?

0 Answers0