Filtering out certain bytes in python

Question

I'm getting this error in my python program: ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

This question, random text from /dev/random raising an error in lxml: All strings must be XML compatible: Unicode or ASCII, no NULL bytes, explains the issue.

The solution was to filter out certain bytes, but I'm confused about how to go about doing this.

Any help?

Edit: sorry if i didn't give enough info about the problem. the string data comes from an external api query of which i have no control over the how the data is formatted.

Do you also have random data on input as in the question you're refering to? — Piotr Dobrogost, Jan 04 '12 at 20:12

score 31 · Accepted Answer · edited Sep 18 '14 at 16:57

31

As the answer to the linked question said, the XML standard defines a valid character as:

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

Translating that into Python:

def valid_xml_char_ordinal(c):
    codepoint = ord(c)
    # conditions ordered by presumed frequency
    return (
        0x20 <= codepoint <= 0xD7FF or
        codepoint in (0x9, 0xA, 0xD) or
        0xE000 <= codepoint <= 0xFFFD or
        0x10000 <= codepoint <= 0x10FFFF
        )

You can then use that function however you need to, e.g.

cleaned_string = ''.join(c for c in input_string if valid_xml_char_ordinal(c))

edited Sep 18 '14 at 16:57

mlissner

17,359
18
106
169

answered Jan 04 '12 at 23:18

John Machin

81,303
11
141
189

This looks like it might work. Unfortunately the data comes from an external api and the error is thrown very rarely. So i have no way to reproduce the conditions that caused the error. I have to wait until the circumstances are the same before I can test this solution. Thanks for the response though, I'm fairly new to python and understanding the implementation of your solution(and how it solves my problem) will give me a chance to understand the language better. – y3di Jan 05 '12 at 10:53
@y3di: While you are waiting for the error to happen again, you might like to change your code to trap the error and write the offending XML document to a log file. That way you can find out exactly what the illegal character(s) are, and decide whether stripping them out or substituting some other character(s) is more appropriate. In any case, your question (in effect: how to filter out illegal characters from an XML document) has been answered, and you should consider accepting it (click on the "tick" to the left of the answer). – John Machin Jan 05 '12 at 20:41
1

For attribution purposes, I'm just plugging here a gist with a code for filtering this chars from strings as I had to use this recently. Thanks for your answer. https://gist.github.com/rcalsaverini/6212717 – Rafael S. Calsaverini Aug 12 '13 at 16:46
I am finding that character '\xa9' is not accepted by lxml.builder.E but it is not filtered out by your function. `from lxml.builder import E` `E('a','\xa9')` ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters `''.join(c for c in '\xa9' if valid_xml_char_ordinal(c))` '\xa9' – DavidF Oct 24 '17 at 21:32

mlissner · Answer 2 · 2017-12-13T06:44:03.577

Another approach that's much faster than the answer above is to use regular expressions, like so:

re.sub(u'[^\u0020-\uD7FF\u0009\u000A\u000D\uE000-\uFFFD\U00010000-\U0010FFFF]+', '', text)

Comparing to the answer above, it comes out to be more than 10X faster in my testing:

import timeit

func_test = """
def valid_xml_char_ordinal(c):
    codepoint = ord(c)
    # conditions ordered by presumed frequency
    return (
        0x20 <= codepoint <= 0xD7FF or
        codepoint in (0x9, 0xA, 0xD) or
        0xE000 <= codepoint <= 0xFFFD or
        0x10000 <= codepoint <= 0x10FFFF
    );
''.join(c for c in r.content if valid_xml_char_ordinal(c))
"""

func_setup = """
import requests; 
r = requests.get("https://stackoverflow.com/questions/8733233/")
"""

regex_test = """re.sub(u'[^\u0020-\uD7FF\u0009\u000A\u000D\uE000-\uFFFD\U00010000-\U0010FFFF]+', '', r.content)"""
regex_setup = """
import requests, re; 
r = requests.get("https://stackoverflow.com/questions/8733233/")
"""

func_test = timeit.Timer(func_test, setup=func_setup)
regex_test = timeit.Timer(regex_test, setup=regex_setup)

print func_test.timeit(100)
print regex_test.timeit(100)

Output:

> 2.63773989677
> 0.221401929855

So, making sense of that, what we're doing is downloading this webpage once (the page you're currently reading), then running the functional technique and the regex technique over its contents 100X each.

Using the functional method takes about 2.6 seconds.
Using the regex method takes about 0.2 seconds.

Update: As identified in the comments, the regex in this answer previously deleted some characters, which should have been allowed in XML. These characters include anything in the Supplementary Multilingual Plane, which is includes ancient scripts like cuneiform, hieroglyphics, and (weirdly) emojis.

The correct regex is now above. A quick test for this in the future is using re.DEBUG, which prints:

In [52]: re.compile(u'[^\u0020-\uD7FF\u0009\u000A\u000D\uE000-\uFFFD\U00010000-\U0010FFFF]+', re.DEBUG)
max_repeat 1 4294967295
  in
    negate None
    range (32, 55295)
    literal 9
    literal 10
    literal 13
    range (57344, 65533)
    range (65536, 1114111)
Out[52]: re.compile(ur'[^ -\ud7ff\t\n\r\ue000-\ufffd\U00010000-\U0010ffff]+', re.DEBUG)

My apologies for the error. I can only offer that I found this answer elsewhere and put it in here. It was somebody else's error, but I propagated it. My sincere apologies to anybody this affected.

Update 2, 2017-12-12: I've learned from some OSX users that this code won't work on so-called narrow builds of Python, which apparently OSX sometimes has. You can check this by running import sys; sys.maxunicode. If it prints 65535, the code here won't work until you install a "wide build". See more about this here.

are you sure the regular expression is working as expected? I believe that the last two `F` letters are not being parsed as part of the `\u` escape. Try this in a python console: `print(u'\u10ffff')` And then this: `print(u'\u10ffzz')` — andres.riancho, Jun 27 '17 at 13:29
You're right, and this definitely could have caused data erasure by removing unicode with codepoints between 65536 and 1114111. I've updated the answer. — mlissner, Jul 03 '17 at 18:34
This throws an error `radars_string = re.sub(u'[^\u0020-\uD7FF\u0009\u000A\u000D\uE000-\uFFFD\U00010000-\U0010FFFF]+', '', radars_string) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 155, in sub return _compile(pattern, flags).sub(repl, string, count) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 251, in _compile raise error, v # invalid expression sre_constants.error: bad character range` — user3508811, Oct 20 '18 at 00:37
I suspect you may have a "narrow" build of Python? Not positive, but it's easy to check if using a wide build fixes the issue: https://www.python.org/dev/peps/pep-0261/ — mlissner, Oct 22 '18 at 18:53

sage · Answer 3 · 2013-12-29T04:14:42.133

4

I think this is harsh/overkill and it seems painfully slow, but my program is still quick and after struggling to comprehend what was going wrong (even after I attempted to implement @John's cleaned_string implementation), I just adapted his answer to purge ASCII-unprintable using the following (Python 2.7):

from curses import ascii
def clean(text):
    return str(''.join(
            ascii.isprint(c) and c or '?' for c in text
            ))

I'm not sure what I did wrong with the better option, but I just wanted to move on...

edited Dec 29 '13 at 04:14

answered Dec 28 '13 at 22:02

sage

4,863
2
44
47

you could use `bytes.translate()` to remove/replace bytes that are not in `string.printable`. Unlike @John Machin's solution; it doesn't work for arbitrary Unicode text. – jfs Sep 20 '14 at 07:28

score -3 · Answer 4 · answered Jan 12 '13 at 17:58

-3

you may refer to the solution on this website:

https://mailman-mail5.webfaction.com/pipermail/lxml/2011-July/006090.html

That solution works for me. You may also have to consider John Machin's solution.

Good luck!

answered Jan 12 '13 at 17:58

Xiangju

1

The link is broken – Steven Jan 19 '20 at 16:13

Filtering out certain bytes in python

4 Answers4

Linked