MarkLogic sql S1 collation showing wrong values

Question

Lets have a rather simple TDE schema:

<tde:template xmlns:tde='http://marklogic.com/xdmp/tde'>
  <tde:description>testing a TDE view</tde:description>
  <tde:context>/test</tde:context>
  <tde:enabled>true</tde:enabled>
  <tde:rows>
    <tde:row>
      <tde:schema-name>sitefusion</tde:schema-name>
      <tde:view-name>test</tde:view-name>
      <tde:columns>
        <tde:column>
          <tde:name>name</tde:name>
          <tde:scalar-type>string</tde:scalar-type>
          <tde:val>name</tde:val>
          <tde:nullable>false</tde:nullable>
          <tde:collation>http://marklogic.com/collation/en/S1</tde:collation>
        </tde:column>
      </tde:columns>
    </tde:row>
  </tde:rows>
</tde:template>

This schema has a collation http://marklogic.com/collation/en/S1 meaning it is case and diacritic insensitive. With that in mind, i inserted a test document with a first char uppercase name Test:

let $doc := 
<test>
  <name>Test</name>
</test>
return xdmp:document-insert("/test/1.xml", $doc);

let $data := xdmp:sql("select name from test")

return json:array-pop($data)[2];

The output of that xquery is the expected value of name Test.

Now lets update the name and put a lowercase test in it:

let $doc := 
<test>
  <name>test</name>
</test>
return xdmp:document-insert("/test/1.xml", $doc);

let $data := xdmp:sql("select name from test")

return (
   json:array-pop($data)[2], 
   doc("/test/1.xml")
)

This, for some reason does now also output Test with a uppercase T. Why is this the case? The output of doc("/test/1.xml") does show the correct document:

<test>
      <name>test</name>
</test>

Is this the correct behaviour? Does collation not only affect comparing (like in a where clause), but also ingestion? Is there a way to have case-sensitive data and case-insensitive comparing ?

Edit: Using MarkLogic 9.0-7.2.

I suspect that you see intermediate commits with fn:doc, but not via xdmp:sql. Keep in mind that indexes have not been fully committed at that point, so I guess this is to be expected. Did you try running the sql in a separate call? Also keep in mind that collation is about comparing, not about storage nor fetching.. — grtjn, Nov 27 '18 at 19:00
Hm never heard of intermediate commits :). But yes, i tried running them in separat calls (different query console tabs) but the sql query will always return `Test`. "collation is about comparing, not about storage nor fetching" -> Thats what i thought and thats why dont unterstand why the sql view does not show the updated values :) — Wagner Michael, Nov 27 '18 at 19:21
It would appear that the column index is not being updated because "test" and "Test" are seen as equivalent when using the case-insensitive collation. If you use the `http://marklogic.com/collation/en` collation, then the column is updated and different values as returned, and if you then switch back to `http://marklogic.com/collation/en/S1` and insert the lower-case "test" doc first, it will return "test" even when you later insert the "Test" doc. — Mads Hansen, Nov 28 '18 at 03:05
@MadsHansen Yes, this is exactly what i am seeing. Is this the expected behavior or should i report it as a bug ? — Wagner Michael, Nov 28 '18 at 06:37
Just for completeness - the discussion around this scenario - mapping TDE to an ORM framework has been captured as a Request for Enhancement to be considered for a future version of MarkLogic. The RFE number is RFE-4231. If anyone other than Michael is interested in improvements in this area, please also open support cases and mention RFE-4231. — Lech Rzedzicki, Dec 21 '18 at 12:04

score 7 · Accepted Answer · answered Nov 28 '18 at 11:08

7

This is happening because the triple index stores unique values in a value dictionary for compression. Since "test" and "Test" are considered equal under the collation only one of the values will be stored in the value dictionary - in this case "Test".

By using a case and diacritic insensitive collation you are effectively saying that you don't care about case and diacritic differences. That means that you shouldn't care about the cases or diacritics used in the values returned from that column.

If you have a use case for retrieving the exact string as it appears in the original document, then add two columns - one that is case and diacritic sensitive, and one that uses the default codepoint collation. Search using the former, but return results from the latter.

answered Nov 28 '18 at 11:08

John Snelson

941
4
4

1

Thanks alot for the explanation! Not really the answer i wanted to hear, but technically this makes sense :) I don't think i can use your suggested solution with the duplicated columns though. We are using a Hibernate as a ORM on top of ODBC on top of TDE/MarkLogic. I don't think i can configure Hibernate/Jpa to search on column 'a' and return column 'b'. I will just try to avoid the sql calls and stick to native MarkLogic calls whenever i need case sensitivity :/ – Wagner Michael Nov 28 '18 at 12:13
1

One more thing i wanted to add: As far as i can remember, most other databases use collation only for comparing, but not for fetching data. This makes MarkLogic quite incompatible if you want to change from a case insensitive sql database to marklogic. More than that, i think it really makes no sense to return data "case insenstive". In my eyes, this should really be mentioned in their documentation. – Wagner Michael Nov 28 '18 at 12:57
Maybe it helps to keep in mind that you are running sql on 'views', not on 'tables' in MarkLogic.. – grtjn Nov 28 '18 at 19:45
Views in sql databases still return the "real" value(if not somehow otherwise calculated by the programmer). – Wagner Michael Nov 29 '18 at 09:12

MarkLogic sql S1 collation showing wrong values

1 Answers1