cv2 rename ä ö ü to ae oe ue

Question

in the code i'm converting multiple 1-page PDFs into PNG Format. The converting itself works out well with cv2 but sadly many documents (PDFs) names contain german umlauts (ä,ö,ü) and the PNGs end up having special characters.

Example: After converting the PDF (lösung_122.png) to PNG, it looks like this "lÃ¶sung_122.png". It should be loesung_122.png.

I would like to replace all these characters (ä,ö,ü) in the document titles with ae, oe, ue.

How can i adjust my code to archieve this? What options do i have? Maybe theres a way to rename the documents (PDFs) before converting them?

from pdf2image import convert_from_path
import os
import cv2


if __name__ == '__main__':

    # Init
    dir_name = os.getcwd()
    path_pdf = dir_name + '/data/doc/October' #Folder containing all documents (PDF)
    save_path = dir_name + '/data/blanko/' #Folder with all converted doc (PNG)

    # Loop sub Folders:
    files = os.listdir(path_pdf)
    for pdf_file in files:

        # Check if PDF file
        if pdf_file[-3:] == 'pdf':
            images = convert_from_path(path_pdf + '/' + pdf_file, dpi=300, poppler_path='C:/Develop/poppler-0.68.0_x86/poppler-0.68.0/bin')

            # Save Images
            images[0].save(save_path + 'tmp.png', 'PNG')
            img = cv2.imread(save_path + 'tmp.png')
            cv2.imwrite(save_path + pdf_file[:-4] + '.png', img)

Any help appreciated

Regards

yes, maybe there is a way to rename files? I'm sure you've looked for ways... did nothing suit you? — Christoph Rackwitz, Oct 25 '21 at 15:27
Does this answer your question? [What is the best way to remove accents (normalize) in a Python unicode string?](https://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-normalize-in-a-python-unicode-string) Note that most answers there convert "lösung" to "losung", so you need to do custom replacement beforehand, eg. with `str.replace()`. — lenz, Oct 25 '21 at 16:14
Thanks for the link! I'll save it up as a possiblity for the future. For now i used Mark Tolonens solution, this gets rid of the special characters — Miyn, Oct 26 '21 at 07:58

Mark Tolonen · Accepted Answer · 2021-10-26T15:04:06.877

0

I's a bug in cv2.imwrite() that it is is mangling the name you give it. You can try this to unmangle the name:

    result = os.path.join(save_path, os.path.splitext(pdf_file)[0] + '.png')
    cv2.imwrite(result, img)
    os.rename(result.encode().decode('mbcs'),result)

This renames the file form the mangled form back to the original. Note this doesn't remove the umlauts, since Windows can handle those characters in names.

Note, though, that it can only restore characters represented in your local encoding, which is probably Windows-1252.

edited Oct 26 '21 at 15:04

answered Oct 25 '21 at 18:13

Mark Tolonen

166,664
26
169
251

Thank you so much! This helped out a lot, i dont see any special characters anymore. :) – Miyn Oct 26 '21 at 07:52

cv2 rename ä ö ü to ae oe ue

1 Answers1