How do you write text extracted from PDF (using textract) to docx files in python

Question

I have several articles in a single pdf file and I am trying to separate those articles and write them to separate Docx files. I managed to separate them using regex but when I try to write them to docx files, it throws this error: ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters.

My code is as follows:

my_path = "/path/to/pdf"

newpath = textract.process(my_path)
newpath2 = newpath.decode("UTF-8")

result = re.findall(r'\d+ words(.*?)Document \w+', newpath2, re.DOTALL)


save_path = "/path/to/write/docx/files/"

for each in result:
    import time
    time=str(time.time())
    finalpath = (os.path.join(save_path, time))
    finalpath2 = finalpath+".docx"
    mydoc = docx.Document()
    mydoc.add_paragraph(each)
    mydoc.save(finalpath2)

So, why not remove them? `.add_paragraph(remove_control_characters(each.replace('\x00','')))` with the `remove_control_characters` from [this answer](https://stackoverflow.com/a/19016117/3832970). — Wiktor Stribiżew, Nov 26 '20 at 11:02
@WiktorStribiżew Thanks a lot. Do I need to import something before doing this? Its throwing this error: NameError: name 'remove_control_characters' is not defined — Eisenheim, Nov 26 '20 at 11:52
I shared the link to the question with that function definition. Just copy and paste that function to your code. — Wiktor Stribiżew, Nov 26 '20 at 11:58
@WiktorStribiżew Accepted! Quick question though. I will be using more PDF files which may have other unicode characters. How do I make this code anticipate other unicode characters without actually telling it what character to replace — Eisenheim, Nov 26 '20 at 12:30
I understand it is not a problem if the text has Unicode chars in it. You just need to remove those that make it incompatible with some parser you are using. — Wiktor Stribiżew, Nov 26 '20 at 12:53

score 2 · Accepted Answer · answered Nov 26 '20 at 12:17

You can remove all null and control byte chars and use

.add_paragraph(remove_control_characters(each.replace('\x00','')))

The remove_control_characters function can be borrowed from Removing control characters from a string in python thread.

Code snippet:

import unicodedata
def remove_control_characters(s):
    return "".join(ch for ch in s if unicodedata.category(ch)[0]!="C")

my_path = "/path/to/pdf"

newpath = textract.process(my_path)
newpath2 = newpath.decode("UTF-8")

result = re.findall(r'\d+ words(.*?)Document \w+', newpath2, re.DOTALL)

save_path = "/path/to/write/docx/files/"

for each in result:
    import time
    time=str(time.time())
    finalpath = (os.path.join(save_path, time))
    finalpath2 = finalpath+".docx"
    mydoc = docx.Document()
    mydoc.add_paragraph(remove_control_characters(each.replace('\x00','')))
    mydoc.save(finalpath2)

How do you write text extracted from PDF (using textract) to docx files in python

1 Answers1