Remove non UTF-8 characters from an XML file, using sed

Question

A given XML file with UTF-8 declared as the encoding does not pass xmllint. With the assumption that a non UTF-8 character is causing the error, the following sed command is being run against the file. sed 's/[^\x00-\x7F]//g' file.xml. Either the command is wrong, or non UTF-8 characters are not the problem, as xmllint still fails after running the sed. The first question is: does the sed regex appear correct?

= = = = =

Here is the output of xmllint: $ xmllint file.xml file.xml:35533: parser error : CData section not finished <img alt="Diets of 2013" src="h What You Eat: Foods low in sugar and carbs and high in fat—80% of cal ^ file.xml:35533: parser error : PCDATA invalid Char value 31 What You Eat: Foods low in sugar and carbs and high in fat—80% of cal ^ file.xml:35588: parser error : Sequence ']]>' not allowed in content as.people.com/2013/11/07/kerry-washington-pregnant-diet-green-smoothie-recipe/"] ^

= = = = =

UPDATE: In TextMate, on viewing the file, there is a character that is being shown as <US>. If that character is manually deleted from the file, the file then passes xmllint.

The character `` is code point `\x1f`. What does xmllint say is the error? — Phylogenesis, Mar 10 '15 at 14:38
You want to have a look at [Why are “control” characters illegal in XML 1.0?](http://stackoverflow.com/questions/404107/why-are-control-characters-illegal-in-xml-1-0). — halfbit, Mar 10 '15 at 14:40
@halfbit: Thanks. Does it seem that the `sed` regex would need to be modified, to strip out control characters? — jerome, Mar 10 '15 at 14:53
Yes, according to [the spec](http://www.w3.org/TR/REC-xml/#charsets) the only characters between `\x00` and `\x1f` that are valid are `\x09`, `\x0a` and `\x0d`. — Phylogenesis, Mar 10 '15 at 14:58
sed works on characters not on bytes. If the encoding of the file is wrong, you've no idea what sed will see. You need a tool that works at the binary level, not the character level. — Michael Kay, Mar 10 '15 at 17:16
possible duplicate of [Using sed, how can a regular expression match Chinese characters?](http://stackoverflow.com/questions/23188189/using-sed-how-can-a-regular-expression-match-chinese-characters) — Paul Sweatte, Sep 17 '15 at 01:20

score 0 · Answer 1 · answered Aug 27 '20 at 20:20

It is somewhat hard to work with sed to remove specific code points from Unicode table.

In case you need to target specific Unicode categories of characters it makes more sense to work with Perl.

perl -pe -i 's/(?![\t\n\r])\p{Cc}//g' file

will remove all control characters but TAB, CR and LF.

Remove non UTF-8 characters from an XML file, using sed

1 Answers1