0

I have an xml document that is UTF-8 encoded with no BOM and using named entities in the element text. I use Powershell to get the file and swap named for numeric entities (as I don't always have access to the DTD or XSD files) and post the modified xml to a REST endpoint (it uses xdmp:document-insert).

For those documents with accented characters in the attribute values I get "Invalid UTF-8 escape sequence" reported in the log file. Xml fragment below...

... in Brazil (<xref ref-type="bibr" rid="i0892-1016-51-1-72-BrazilMinistériodoMeioAmbienteMMAInstitutoChicoMendesdeConservaçãodaBiodiversidadeICMBio1">Brazil Minist&eacute;rio do Meio Ambiente, Instituto Chico Mendes de Conserva&ccedil;&atilde;o da Biodiversidade 2014</xref>). This species builds....

Apart from using Powershell to swap these characters to their numeric entity form is there any xquery code to deal with this or a setting in MarkLogic? The characters on this occasion are western European and the attributes are not used in indexes.

MarkLogic 8.0-6.7 Windows 10 Powershell 5.1

Addition Over the weekend I had a look around. On the MarkLogic side I pulled a copy of the xdmp:get-request-body outside a 'try-catch'. and the error confirms your (Mads) suspicion. I looked at the Powershell and it imports text content as UTF8 (Encode a string in UTF-8) but was clearly posting the text out as default character set (1252?).

function getBody ($FilePath)
{
$fileContentBinary = [System.IO.File]::ReadAllBytes($FilePath)
$enc               = [System.Text.Encoding]::GetEncoding("UTF-8")
$encodedContent    =  $enc.GetString($fileContentBinary)
$encodedContent    = elementReplace($encodedContent) 
return $encodedContent 
}

function sendXml ($MLHost, $LocalFilePath, $SUPPLIER_REF, $credentials, $xsltTRANSFORMLABEL)
{
 Add-content $logfile -value ('Posting file '  + $LocalFilePath + ' to ' + $MLHost + ' for supplier ' + $SUPPLIER_REF)
 $filename        =  (Split-Path $LocalFilePath -leaf)
 $EndpointAddress = 'http://{0}:######/nps3/article/upload/?supplier={1}&filename={2}&transform={3}' -f $MLHost, $SUPPLIER_REF, $filename, $xsltTRANSFORMLABEL ;
 $boundary        =  [System.Guid]::NewGuid().ToString()
 $bodyText        =  makeBody $LocalFilePath
 $contentType     = 'multipart/form-data; boundary={0}' -f $boundary;
 try   { 
       Invoke-RestMethod -uri $EndpointAddress -Method PUT -ContentType $contentType -body $bodyText -Credential $credentials

       #all ok so delete file
       if (Test-Path $LocalFilePath) {
       Remove-Item $LocalFilePath
        }
        }
  catch {
        Add-content $logfile -value ('A problem was encountered inserting "' + (Split-Path $LocalFilePath -leaf) + ' --> ' + $_.Exception.Message )
    }}

I added $OutputEncoding = New-Object -typename System.Text.UTF8Encoding to the top of the Powershell script (assuming it sets UTF8 as the default character set for the session??) and also added a charset parameter to the $contentType statement

$contentType = 'multipart/form-data; boundary={0} ; charset=utf-8' -f $boundary;

These changes appear to have corrected the issue. Does '$OutputEncoding' change the entire coding for the session to UTF8 if added at the top of the code?

Guy Yeates
  • 85
  • 4
  • extra:Just loaded a copy of the document (UTF-8) using xdmp:document-load via the Qconsole without issue – Guy Yeates Mar 09 '18 at 15:38
  • Can you post the code from your Powershell script that shows how you are reading, replacing and writing, and how you invoked the REST endpoint? Since you are on Windows, it is very likely that a default encoding of CP-1252 was applied, mangling some multi-byte UTF-8 characters, but it is hard to diagnose if you don't share any of your code. – Mads Hansen Mar 10 '18 at 12:54
  • Updated with code snippets and a possible fix – Guy Yeates Mar 12 '18 at 12:31

0 Answers0