0

I have recently written a script to extract all bookmarks from a pdf and save them in a docx file. It works for 90% of the files but unfortunaltely there are some that seem to have problems with unicode.

I get the bookmarks in a list like this:

[[u'3. Mechatronik f\xfcr Doppelkupplungsgetriebe, Sicherungshalter B, Sicherung 14 auf Sicherungshalter C', 2],
[u'4. Geber f\xfcr Getriebeeingangsdrehzahl, Hydraulikdruckgeber 1 f\xfcr automatisches Getriebe, Magnetventil 2, Magnetventil \x04, Magnetventil 5', 2],
[u'5. W\xe4hlhebel, Schalter f\xfcr W\xe4hlhebel in P gesperrt, Magnet f\xfcr W\xe4hlhebelsperre', 2], 
[u'6. W\xe4hlhebel, Geber 2 f\xfcr Antriebswellendrehzahl, W\xe4hlhebel-Positionsanzeige', 2]]

When i try to run the function i get the error:

ValueError('All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters',)

Code:

from docx import Document

list1 = [[u'3. Mechatronik f\xfcr Doppelkupplungsgetriebe, Sicherungshalter B, Sicherung 14 auf Sicherungshalter C', 2],
    [u'4. Geber f\xfcr Getriebeeingangsdrehzahl, Hydraulikdruckgeber 1 f\xfcr automatisches Getriebe, Magnetventil 2, Magnetventil \x04, Magnetventil 5', 2],
    [u'5. W\xe4hlhebel, Schalter f\xfcr W\xe4hlhebel in P gesperrt, Magnet f\xfcr W\xe4hlhebelsperre', 2],
    [u'6. W\xe4hlhebel, Geber 2 f\xfcr Antriebswellendrehzahl, W\xe4hlhebel-Positionsanzeige', 2]]

def save_docx(list1):
document = Document('default.docx')
file = open("Error_Log.txt", 'w')
for i in list1:
    try:
        p = document.add_paragraph()
        p.add_run(i[0]).bold = True
    except Exception as e:
        file.write(repr(e) + '\n')
file.close()
document.save('Bookmarks.docx')

save_docx(list1)

Im guessing the problem ist the \x0 but I can not figure out how to remove parts like this without ruining the whole document. I have tried diffenrent encodings and anything else I could find online but nothing worked so far.

Any help would be much appreciated!

TacashiX
  • 3
  • 3
  • did you try this? `i[0].encode('utf-8')` based on the discussion in http://stackoverflow.com/questions/5760936/handle-wrongly-encoded-character-in-python-unicode-string – Gerrit Verhaar Dec 07 '16 at 10:54
  • yes i tried de- and encoding in various ways e.g. `i[0].encode('ascii' 'ignore')` etc. nothing worked. Also looked at libraries that might help but no luck so far. – TacashiX Dec 07 '16 at 11:03
  • nice answer from @jackmorris. Could it be that after the encode the control character was still in the string? Thus the end result would be the same (error 'no control characters') – Gerrit Verhaar Dec 07 '16 at 11:23

1 Answers1

1

Your assumption seems correct: \x04 is a control character, and your error message explicitly states that controls aren't allowed.

You can filter out control characters from your strings before adding them to the document, which should fix your issue. This can be done with Python's unicodedata module, specifically unicodedata.category. The categories you want to exclude start with 'C' (from http://www.unicode.org/reports/tr44/#GC_Values_Table), which encompasses all of the control characters.

The following should work, in place of your current add_run line:

line = filter(lambda c: unicodedata.category(c)[0] != 'C', i[0])
p.add_run(line).bold = True

As an aside, the typical way of including unicode characters in a unicode string is with \uXXXX, rather than \xXX (where XXXX is the hex of the unicode code point).

JPEG_
  • 321
  • 1
  • 3
  • 11
  • The category returned by unicodedata for `\x04` is `Cc`, not `C`. And I wouldn't say that the `\uXXXX` notation is the "typical" way, there is no difference between `\xXX`, `\u00XX` and `\U000000XX` for a code point below 256, and python itself always uses the shortest possible form, e.g `ascii("\U000000FF")` (or `repr(u"\U000000FF")` in python2) gives `\xff`. – mata Dec 07 '16 at 11:29
  • The category 'C' includes 'Cc', as well as 'Cf', which is a format control character. – JPEG_ Dec 07 '16 at 11:36
  • To the other point, 'typical' is probably the wrong word to use, however I think it makes more sense to specify unicode characters as code points rather than byte values, particularly when you exceed 256. You're right in saying that it makes no difference for low-valued code points. – JPEG_ Dec 07 '16 at 11:45
  • Amazing answer! Thank you very much! Im quite new to python and this would have taken me ages to figure out. – TacashiX Dec 07 '16 at 11:53
  • Yes, but you're comparing `unicodedata.category(c) != 'C'`, which will fail if the returned category is `Cc` and therefore filter nothing, you'd need to only compare the first character. And as the OP probably didn't type that string but copy its representation from somewhere, suggesting to change escape sequences seems a bit excessive. I prefer python's way of using the shortest possible form to escape a code point, it's just a different way of expressing numeric values. That the same escape form can be used to represent a byte value in a different context has nothing to do with unicode. – mata Dec 07 '16 at 12:17
  • Of course - thanks for the correction, updated my answer. – JPEG_ Dec 07 '16 at 13:11