I am in search of an XSLT/XQuery function based on XPath 2.0. I am sure this function must already exist somewhere, but my keyword searches don't yield an answer.
Presume a sequence of n strings, each of whose items can be converted into a sequence of strings of arbitrary length using tokenize. For example, given input (or see Michael Hor edits below for simpler version):
<alignment>
<text>really the man and a hand</text>
<text>the hand and a man</text>
<text>really a hand and the man</text>
<text>hand man a the</text>
<text>the man really is a hand</text>
<text>man and a hand</text>
</alignment>
...bound to a variable $array via for $i in //text return replace($i,'\s+',':')
$array=("really:the:man:and:a:hand",
"the:hand:and:a:man",
"really:a:hand:and:the:man",
"hand:man:a:the",
"the:man:really:is:a:hand",
"man:and:a:hand")
We wish to create a customized function based upon XPath 2.0, exfn:comb-array($arg1 as xs:item+,$arg2 as xs:string)
that, when given $array
and its delimiter, returns the smallest combinatorial array. That is, a new sequence of n sequence-strings where null values and delimiters have been inserted such that (1) when tokenized each new sequence-string has the same length (len), (2) that length is the smallest possible, (3) the internal order of each original sequence-string is preserved, and (4) for each position in each sequence-string the value is null or an identical string. That is, for $i in $newarray return distinct-values(subsequence(tokenize($i,':'),x,1))
is either null or the same string value for every x where 1 < x < len + 1.
To use the illustration above, we want the output (or see edits below):
$arraynew=("really::::::the:man::::and:a:hand:",
"::::::the::hand:::and:a::man",
"really:::a:hand:and:the:man:::::::",
":hand:man:a:::the::::::::",
"::::::the:man::really:is::a:hand:",
":::::::man::::and:a:hand:")
This is a simplified form of a more general question about how to align multiple versions of a text that show heavy editing, but to conduct the alignment such that document order of each version is preserved, but the resulting union document (array of versions) is of the least possible length.
EDIT by michael.hor257k
I have taken the liberty of transforming your example to a more intelligible form:
INPUT:
A, B, C, D, E, F,
B, F, D, E, C,
A, E, F, D, B, C,
F, C, E, B,
B, C, A, G, E, F,
C, D, E, F
OUTPUT
A, -, -, -, -, -, B, C, -, -, -, D, E, F, -,
-, -, -, -, -, -, B, -, F, -, -, D, E, -, C,
A, -, -, E, F, D, B, C, -, -, -, -, -, -, -,
-, F, C, E, -, -, B, -, -, -, -, -, -, -, -,
-, -, -, -, -, -, B, C, -, A, G, -, E, F, -,
-, -, -, -, -, -, -, C, -, -, -, D, E, F, -
(and I still don't get it).