1

I am in search of an XSLT/XQuery function based on XPath 2.0. I am sure this function must already exist somewhere, but my keyword searches don't yield an answer.

Presume a sequence of n strings, each of whose items can be converted into a sequence of strings of arbitrary length using tokenize. For example, given input (or see Michael Hor edits below for simpler version):

<alignment>
  <text>really the man and a hand</text>
  <text>the hand and a man</text>
  <text>really a hand and the man</text>
  <text>hand man a the</text>
  <text>the man really is a hand</text>
  <text>man and a hand</text>
</alignment>

...bound to a variable $array via for $i in //text return replace($i,'\s+',':')

$array=("really:the:man:and:a:hand",
        "the:hand:and:a:man",
        "really:a:hand:and:the:man",
        "hand:man:a:the",
        "the:man:really:is:a:hand",
        "man:and:a:hand")

We wish to create a customized function based upon XPath 2.0, exfn:comb-array($arg1 as xs:item+,$arg2 as xs:string) that, when given $array and its delimiter, returns the smallest combinatorial array. That is, a new sequence of n sequence-strings where null values and delimiters have been inserted such that (1) when tokenized each new sequence-string has the same length (len), (2) that length is the smallest possible, (3) the internal order of each original sequence-string is preserved, and (4) for each position in each sequence-string the value is null or an identical string. That is, for $i in $newarray return distinct-values(subsequence(tokenize($i,':'),x,1)) is either null or the same string value for every x where 1 < x < len + 1.

To use the illustration above, we want the output (or see edits below):

$arraynew=("really::::::the:man::::and:a:hand:",
"::::::the::hand:::and:a::man",
"really:::a:hand:and:the:man:::::::",
":hand:man:a:::the::::::::",
"::::::the:man::really:is::a:hand:",
":::::::man::::and:a:hand:")

This is a simplified form of a more general question about how to align multiple versions of a text that show heavy editing, but to conduct the alignment such that document order of each version is preserved, but the resulting union document (array of versions) is of the least possible length.


EDIT by michael.hor257k

I have taken the liberty of transforming your example to a more intelligible form:

INPUT:

A, B, C, D, E, F,
B, F, D, E, C,
A, E, F, D, B, C,
F, C, E, B,
B, C, A, G, E, F,
C, D, E, F

OUTPUT

A, -, -, -, -, -, B, C, -, -, -, D, E, F, -,
-, -, -, -, -, -, B, -, F, -, -, D, E, -, C,
A, -, -, E, F, D, B, C, -, -, -, -, -, -, -,
-, F, C, E, -, -, B, -, -, -, -, -, -, -, -,
-, -, -, -, -, -, B, C, -, A, G, -, E, F, -,
-, -, -, -, -, -, -, C, -, -, -, D, E, F, -

(and I still don't get it).

jdk
  • 111
  • 5
  • Welcome to stackoverflow. http://whathaveyoutried.com? Please show us what you have so far. SO is not a code writing service, and you will get a better response if you provide evidence of your own work. Please see [the Help pages](http://stackoverflow.com/help). – freefaller Feb 16 '15 at 15:31
  • 2
    XPath, XSLT and XQuery all take _XML_ as input. You forgot to add yours to the question. – Mathias Müller Feb 16 '15 at 17:00
  • "*I am sure this function must already exist somewhere*" I doubt that very much, but *you can find **all** XQuery and XPath 2.0 functions here: http://www.w3.org/TR/xpath-functions/#contents -- I couldn't understand your description of the problem. It seems to me that just appending a couple of empty tokens to each string, so that they all have a uniform "length" (i.e. number of tokens), would satisfy all your conditions. If not, what is the exact algorithm by which you mean to place those empty tokens? – michael.hor257k Feb 16 '15 at 17:23
  • Yes, I know about the elemental XPath functions. The algorithm (the correct comb. of XPath functions) is what I'm after. I am guessing it has to do with picking a starting sequence and for each subsequent sequence find the longest sequence fragment shared with the starting. Then split each sequence into three parts: sequence fragments (1) preceding, (2) part of, and (3) following the longest common overlaps. Loop on each (1) and (3) to produce a new (1), (2), and (3) until no (2) can be produced. There should be more but this is as far as I've gotten. – jdk Feb 17 '15 at 15:23
  • I would suggest you split this exercise into two: (1) find out the exact algorithm to accomplish this task; by *algorithm* I don't mean a combination of XPath (or any other language) functions, but rather a sequence of steps one can follow (even manually) to arrive at the expected solution. Once you have that, ask (2) how to implement this algorithm in XSLT. -- Personally, I would insert yet another step and ask what is the best tool to implement such algorithm; IMHO, XSLT is **not** likely to be a top contender here. – michael.hor257k Feb 17 '15 at 17:37
  • For algorithms see: https://stackoverflow.com/questions/9065536/text-comparison-algorithm. XSLT/XPath is not going to be the best language to implement them. Just use XSLT to extract the texts from your XML and then compare using existing utilities. In general, it is not an easy task. It has many solutions and you want to choose the best one, and even the specification of what is "the best one" may be difficult... – David L. Oct 07 '19 at 14:46

1 Answers1

0

I have developed an XSLT-based answer to this problem, available now in tan:collate-sequences(), in the TAN function library, in the core component.

jdk
  • 111
  • 5