A given XML file with UTF-8 declared as the encoding does not pass xmllint
. With the assumption that a non UTF-8 character is causing the error, the following sed
command is being run against the file. sed 's/[^\x00-\x7F]//g' file.xml
. Either the command is wrong, or non UTF-8 characters are not the problem, as xmllint
still fails after running the sed
. The first question is: does the sed
regex appear correct?
= = = = =
Here is the output of xmllint
:
$ xmllint file.xml
file.xml:35533: parser error : CData section not finished
<p class="imgcont"><img alt="Diets of 2013" src="h
<b>What You Eat: </b>Foods low in sugar and carbs and high in fat—80% of cal
^
file.xml:35533: parser error : PCDATA invalid Char value 31
<b>What You Eat: </b>Foods low in sugar and carbs and high in fat—80% of cal
^
file.xml:35588: parser error : Sequence ']]>' not allowed in content
as.people.com/2013/11/07/kerry-washington-pregnant-diet-green-smoothie-recipe/"]
^
= = = = =
UPDATE: In TextMate, on viewing the file, there is a character that is being shown as <US>
. If that character is manually deleted from the file, the file then passes xmllint
.