Changing utf-8 string to cp1251 (Python)

Question

I'm trying to convert Excel file with polish chars such as "ęśążćółń" to normal letters "esazcoln". Firstly I've menaged to convert xlsx file to txt, then:

f = open("PATH_TO_TXT_FILE")
r = f.read()
r.upper()
new_word = ""
for char in r:
    if char == "Ą":
        new_word += "A"
    elif char == "Ć":
        new_word += "C"
    elif char == "Ę":
        new_word += "E"
    elif char == "Ł":
        new_word += "L"
    elif char == "Ó":
        new_word += "O"
    elif char == "Ż"  "Ź":
        new_word += "Z"
    elif char == "Ź":
        new_word += "Z"
    elif char == "Ś":
        new_word += "S"
    else: 
        new_word += char

encoded_bytes = r.encode('utf-8', "replace")
decoded = encoded_bytes.decode(
    "cp1252", "replace")
print(decoded)

in file is written : asdżółć

Output: asdÃ…Â¼ÃƒÂ³Ã…â€šÃ„â€¡

I'd like to recive: asdzolc

Is there anybody who can help me?

You mean `r = r.upper()`. But this is a really crude method; search for existing solutions involving Unicode normalization. — tripleee, Aug 06 '20 at 15:21
The required string `asdzolc` should be in the `new_word` variable… You could simplify removing accents, see [this anwer](https://stackoverflow.com/a/4160572/3439404). Then only `Ł` and `ł` characters should be replaced explicitly to `L` and `l`… — JosefZ, Aug 08 '20 at 21:53

score 0 · Answer 1 · answered Aug 06 '20 at 15:46

I can't find the stack overflow page from which I got the pattern/sub template, but this is the general idea:

#!/usr/bin/env python3
# coding: UTF-8

import re


mapping = {
    'Ą': 'A',
    'Ć': 'C',
    'Ę': 'E',
    'Ł': 'L',
    'Ó': 'O',
    'Ż': 'Z',
    'Ź': 'Z',
    'Ś': 'S',

    'ą': 'a',
    'ć': 'c',
    'ę': 'e',
    'ł': 'l',
    'ó': 'o',
    'ż': 'z',
    'ź': 'z',
    'ś': 's',
}


pattern = re.compile("|".join(mapping.keys()))


def replace_by_mapping(text):
    return pattern.sub(lambda m: mapping[re.escape(m.group(0))], text)


if __name__ == '__main__':
    with open('polish_test.txt', 'r') as f:

        contents = f.read()
        contents = replace_by_mapping(contents)

        print(contents)

Changing utf-8 string to cp1251 (Python)

1 Answers1