2

I have a series of variable types like:

abc1A, abc1B, abc3B, ...
xyz1A, xyz2A, xyz3C, ...
data1C, data2A, ...

Stored in a variety of xml formats:

<area name="DataMap">
    <int name="number" nullable="true">
        <case var="abc2,abc3,abc5">11</case>
        <case var="abc4,abc6*">8</case>
        <case var="data1,xyz7,xyz8">22</case>
        <case var="data3A,xyz{9},xyz{5A,5B,5C}">24</case>
        <case var="xyz{6,4A,4B,4C}">20</case>
        <case var="other01">15</case>
    </int>
</area>

I'm hoping to query what an instance like xyz5A, for example, maps to. The query should return 24, but I don't know ahead of time if its reference in the xml node is explicit as in "xyz4A", or via a wildcard like "xyz4*", or in curly braces like above.

This queries for strings on that line and will return a hit successfully:

xpath '/area[@name="DataMap"]/int[@name="number"]/case[contains(@var,"xyz")][contains(@var,"5A")]'

But it also returns a hit for data5A which is not incorrect:

xpath '/area[@name="DataMap"]/int[@name="number"]/case[contains(@var,"data")][contains(@var,"5A")]'

Are there xpath/other query constructs that parse the inconsistent (but I assume valid) xml above? I only seem to be able to query against explicit string matches vs. the wildcard and curly braced formats.

Christopher Creutzig
  • 8,656
  • 35
  • 45
mmond
  • 3,553
  • 2
  • 13
  • 7
  • XPath 1.0 or XPath 2.0? (2.0 introduced `matches` with regular expressions.) – Christopher Creutzig May 18 '12 at 06:08
  • Good point. I'm using bash/perl which I guess is still 1.0. If there is a practical means to query with an XPath 2.0, that's great. I'm not sure though I'd have access to Java libs for example, on every system I'd need to query from. – mmond May 18 '12 at 14:16

2 Answers2

1

Being in bash/perl you are likely bound to libxml. libxml doesn't support XPath 2.0. There are many questions on SO about XPath/XSLT 2.0 with libxml/libxslt and Perl.

XPath 1.0 has a variety (a small one I have to admit) of string functions and you could try to stack them up together. I experimented for a bit and neither did I like the result not did I succeed to cover all possible cases. You would have "ugly" constructs like:

...
or
(contains(@var, ',xyz{') and 
 contains(substring-before(substring-after(@var, ',xyz{'), '}'), '5A') and
     (contains(substring-before(substring-after(@var, ',xyz{'), '}'), ',5A,') or
      starts-with(substring-after(@var, ',xyz{'), '5A,') or
      starts-with(substring-after(@var, ',xyz{'), '5A}') or
      substring-after(substring-before(substring-after(@var, ',xyz{'), '}'), ',5A') = ''))

or
...

And then you would realize that substring-* functions work off of the first occurrence of the matching string and you need even more layers of ands and ors to handle cases like yours:

<case var="data3A,xyz{9},xyz{5A,5B,5C}">24</case>

where there are multiple xyz{ and the one you need is not known to be the first one.

I think this is the case where you forget you have an XML and just do what Perl is good for and treat it as text. As much as I like XML-aware tools for XML processing and data extraction you will likely be better off with regexp and string manipulations in the language that was designed for it.

Pavel Veller
  • 6,085
  • 1
  • 26
  • 24
  • I agree a comprehensive solution is likely not a good ROI. It would make more sense to parse it as accurately as I am able to with Xpath + tools and deal with the exceptions manually. Thanks for the input. – mmond May 18 '12 at 22:03
0

I guess the smartest thing would be to iterate over all variables and programmatically find the matches, not asking XPath to do it.

Barring that, I have at least a few thoughts on the braces; unfortunately, they probably don't help all that much for the * question.

It seems that there are perl XPath implementations where you could write .../case[@var =~ /some_regex/], maybe .../case["xyz4A" =~ to_regex(@var)], and maybe even .../case[explode_braces(@var) =~ /(^|,)xyz4A(,|$)/] (with a suitably written explode_braces function, of course). See http://www.perlmonks.org/?node_id=831612, for example. I would expect the explode_braces way to work much, much easier than the first alternative - and I do use regular expressions quite a lot. Then again, you seem to use bash-regexes, and transforming those to a perl regex should also be relatively straightforward, so if the second idea, works, you may be good to go.

If that does not work, maybe hook into your XML parser or right before it and fix this horrible XML design by expanding the braces?

$input =~ s/\bvar="([^"]*)"}/'var="'+explode_braces($2)+'"'/eg;

(Or something very similar, sorry, I haven't written much perl in the last years. Also, this assumes your xml only uses one type of attribute quotes, but that should be easy to fix, and that the only place where var=" is found is in these attributes, which may be a much harder limitation.)

Christopher Creutzig
  • 8,656
  • 35
  • 45