I have an xml document that is UTF-8 encoded with no BOM and using named entities in the element text. I use Powershell to get the file and swap named for numeric entities (as I don't always have access to the DTD or XSD files) and post the modified xml to a REST endpoint (it uses xdmp:document-insert).
For those documents with accented characters in the attribute values I get "Invalid UTF-8 escape sequence" reported in the log file. Xml fragment below...
... in Brazil (<xref ref-type="bibr" rid="i0892-1016-51-1-72-BrazilMinistériodoMeioAmbienteMMAInstitutoChicoMendesdeConservaçãodaBiodiversidadeICMBio1">Brazil Ministério do Meio Ambiente, Instituto Chico Mendes de Conservação da Biodiversidade 2014</xref>). This species builds....
Apart from using Powershell to swap these characters to their numeric entity form is there any xquery code to deal with this or a setting in MarkLogic? The characters on this occasion are western European and the attributes are not used in indexes.
MarkLogic 8.0-6.7 Windows 10 Powershell 5.1
Addition Over the weekend I had a look around. On the MarkLogic side I pulled a copy of the xdmp:get-request-body outside a 'try-catch'. and the error confirms your (Mads) suspicion. I looked at the Powershell and it imports text content as UTF8 (Encode a string in UTF-8) but was clearly posting the text out as default character set (1252?).
function getBody ($FilePath)
{
$fileContentBinary = [System.IO.File]::ReadAllBytes($FilePath)
$enc = [System.Text.Encoding]::GetEncoding("UTF-8")
$encodedContent = $enc.GetString($fileContentBinary)
$encodedContent = elementReplace($encodedContent)
return $encodedContent
}
function sendXml ($MLHost, $LocalFilePath, $SUPPLIER_REF, $credentials, $xsltTRANSFORMLABEL)
{
Add-content $logfile -value ('Posting file ' + $LocalFilePath + ' to ' + $MLHost + ' for supplier ' + $SUPPLIER_REF)
$filename = (Split-Path $LocalFilePath -leaf)
$EndpointAddress = 'http://{0}:######/nps3/article/upload/?supplier={1}&filename={2}&transform={3}' -f $MLHost, $SUPPLIER_REF, $filename, $xsltTRANSFORMLABEL ;
$boundary = [System.Guid]::NewGuid().ToString()
$bodyText = makeBody $LocalFilePath
$contentType = 'multipart/form-data; boundary={0}' -f $boundary;
try {
Invoke-RestMethod -uri $EndpointAddress -Method PUT -ContentType $contentType -body $bodyText -Credential $credentials
#all ok so delete file
if (Test-Path $LocalFilePath) {
Remove-Item $LocalFilePath
}
}
catch {
Add-content $logfile -value ('A problem was encountered inserting "' + (Split-Path $LocalFilePath -leaf) + ' --> ' + $_.Exception.Message )
}}
I added $OutputEncoding = New-Object -typename System.Text.UTF8Encoding to the top of the Powershell script (assuming it sets UTF8 as the default character set for the session??) and also added a charset parameter to the $contentType statement
$contentType = 'multipart/form-data; boundary={0} ; charset=utf-8' -f $boundary;
These changes appear to have corrected the issue. Does '$OutputEncoding' change the entire coding for the session to UTF8 if added at the top of the code?