8

I am creating word documents based on a users input in a form. However, when the user inputs a unicode control character, and trying to make a word file out of this using the python-docx package, this error occurs:

File "src\lxml\apihelpers.pxi", line 1439, in lxml.etree._utf8
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

I managed to tackle this issue by checking the form for invalid xml characters before each request (I have many forms where this problem might occur), and removing any invalid xml characters from the fields. I then make a new Immutable Multi Dictionary, and fill it with the cleaned text.

from docx import Document
from docx.shared import Inches
from flask import Flask, render_template_string, request
from werkzeug.datastructures import ImmutableMultiDict

def valid_xml_char_ordinal(c):
    codepoint = ord(c)
    return (0x20 <= codepoint <= 0xD7FF or codepoint in (0x9, 0xA, 0xD) or
            0xE000 <= codepoint <= 0xFFFD or 0x10000 <= codepoint <= 0x10FFFF)

app = Flask(__name__)

@app.before_request
def before_request():
    if 'check_form_xml_validity' in request.form:
        tuple_list = []
        for field_name in request.form:
            all_field_values = request.form.getlist(field_name)
            for field_value in all_field_values:
                cleaned_field_value = ''.join(c for c in field_value if valid_xml_char_ordinal(c))
                tuple_list.append((field_name, cleaned_field_value))
        request.form = ImmutableMultiDict(tuple_list)

@app.route('/', methods=['GET', 'POST'])
def form_test():
    if request.method == 'GET':
        x = '' # this seemingly empty string is not empty, but contains a bunch of control characters
        return render_template_string(
            """<form action="{{ url_for('form_test') }}" method="post">
                <input name="some_field" value="{{x}}"><br>
                check the xml validity of this form? <br>
                <input type="checkbox" checked name="check_form_xml_validity"><br>
                <button>submit</button>
            </form>""",
            x=x)
    else:
        doc = Document()
        p = doc.add_paragraph(request.form['some_field'])
        return 'yay'

And this method works perfectly. However, it seems very unlikely that I'm the only one with this problem, but yet I couldn't find any clean solutions. So the question is, should I really be solving this problem in the current way? It's pretty tedious, and it feels like I'm overlooking some Flask or python-docx setting or argument somewhere which would solve this issue.

The example is fully functional, and if the checkbox is checked, the before_request function is executed. If the checkbox is not checked, it is not executed and the mentioned server error will show.

enter image description here

The control character is: U+000C : <control-000C> (FORM FEED [FF])

Joost
  • 3,609
  • 2
  • 12
  • 29
  • Doesn't `decode("utf-8","ignore")` do what you want? – pguardiario Aug 19 '18 at 23:08
  • The `field_value` is already a string, so I can't decode it any further. Apart from that, it's not really an encoding error, since the control characters are still valid in the encoding. But lxml doesn't allow them, and I want them ignored on a somewhat global level such as in the example. – Joost Aug 22 '18 at 10:35
  • What do you mean by "clean"? I don't think `flask` nor `python-docx` has something to remove the control characters for you. You will need to add something that does it, and that's ultimately just going to use `unicodedata` most likely. – Luis Orduz Aug 26 '18 at 15:35

1 Answers1

3

There are tons of the control characters in the unicode. So, basically, you need to remove control characters, which is the one of the category in unicode chars. To do that I recommend you to use unicodedata.category from unicodedata module.

See code below:

import unicodedata


def remove_control_chars(s):
    return "".join(ch for ch in s if unicodedata.category(ch)[0] != "C")
Andriy Ivaneyko
  • 20,639
  • 6
  • 60
  • 82
  • Thanks for your answer. Removing the control characters is not the problem though, even though your solution is cleaner. I prefer a solution at the flask or the python-docx level. – Joost Aug 22 '18 at 11:00
  • 2
    @Joost I'm somewhat sure that this up to you to solve, because your app is _the_ boundary: XML doesn't allow control characters (see e.g. https://stackoverflow.com/questions/404107/why-are-control-characters-illegal-in-xml-1-0), while Flask doesn't consider them special. Stripping control characters is not any sort of reasonably-default behavior, it's a strategy that you chose (as opposed to raising errors, replacing with question marks, etc.), so it is up to you to implement it. – matejcik Aug 25 '18 at 23:11
  • Thanks Matejcik, that is the best explanation. I will just use my current solution displayed in the question, so control characters can not enter my app in the first place. – Joost Sep 03 '18 at 12:37