0

I'm trying to convert Excel file with polish chars such as "ęśążćółń" to normal letters "esazcoln". Firstly I've menaged to convert xlsx file to txt, then:

f = open("PATH_TO_TXT_FILE")
r = f.read()
r.upper()
new_word = ""
for char in r:
    if char == "Ą":
        new_word += "A"
    elif char == "Ć":
        new_word += "C"
    elif char == "Ę":
        new_word += "E"
    elif char == "Ł":
        new_word += "L"
    elif char == "Ó":
        new_word += "O"
    elif char == "Ż"  "Ź":
        new_word += "Z"
    elif char == "Ź":
        new_word += "Z"
    elif char == "Ś":
        new_word += "S"
    else: 
        new_word += char

encoded_bytes = r.encode('utf-8', "replace")
decoded = encoded_bytes.decode(
    "cp1252", "replace")
print(decoded)

in file is written : asdżółć

Output: asdżółć

I'd like to recive: asdzolc

Is there anybody who can help me?

tripleee
  • 175,061
  • 34
  • 275
  • 318
Raqie
  • 1
  • 1
    You mean `r = r.upper()`. But this is a really crude method; search for existing solutions involving Unicode normalization. – tripleee Aug 06 '20 at 15:21
  • The required string `asdzolc` should be in the `new_word` variable… You could simplify removing accents, see [this anwer](https://stackoverflow.com/a/4160572/3439404). Then only `Ł` and `ł` characters should be replaced explicitly to `L` and `l`… – JosefZ Aug 08 '20 at 21:53

1 Answers1

0

I can't find the stack overflow page from which I got the pattern/sub template, but this is the general idea:

#!/usr/bin/env python3
# coding: UTF-8

import re


mapping = {
    'Ą': 'A',
    'Ć': 'C',
    'Ę': 'E',
    'Ł': 'L',
    'Ó': 'O',
    'Ż': 'Z',
    'Ź': 'Z',
    'Ś': 'S',

    'ą': 'a',
    'ć': 'c',
    'ę': 'e',
    'ł': 'l',
    'ó': 'o',
    'ż': 'z',
    'ź': 'z',
    'ś': 's',
}


pattern = re.compile("|".join(mapping.keys()))


def replace_by_mapping(text):
    return pattern.sub(lambda m: mapping[re.escape(m.group(0))], text)


if __name__ == '__main__':
    with open('polish_test.txt', 'r') as f:

        contents = f.read()
        contents = replace_by_mapping(contents)

        print(contents)
Matija8
  • 11
  • 1
  • 3