5

So the code I have copied an HTML file into a string and then changed everything to lower case except normal text and comments. The problem is it also changes the åäö into something the VS code can't recognise. What I can find is its a problem with the encoding but can't find anything about it on py3 and the solutions I found for py2 didn't work. Any help is appreciated and if you know how to improve the code plz tell me.

import re
import os


text_list = []

for root, dirs, files in os.walk("."):
    for filename in files:

        if (
            filename.endswith(".html")
        ):
            text_list.append(os.path.join(root, filename))

for file in text_list:

    file_content = open(f"{file}", "r+").read()

    if file.endswith(".html"):
        os.rename(file, file.replace(" ", "_").lower())
        code_strings = re.findall(r"<.+?>", file_content)
        for i, str in enumerate(code_strings):
            new_code_string = code_strings[i].lower()
            file_content = file_content.replace(code_strings[i], new_code_string)

    else:
        os.rename(file, file.replace(" ", "_").lower())
        file_content = file_content.lower()

    open(f"{file}", "r+").write(file_content)
Rick M.
  • 3,045
  • 1
  • 21
  • 39
  • 1
    You should open the file with an encoding, see https://stackoverflow.com/questions/147741/character-reading-from-file-in-python – dogman288 Jul 10 '20 at 09:00
  • Welcome to SO! Could you also add the text to your question so we can check the behavior? It is definitely a problem with encoding – Rick M. Jul 10 '20 at 09:02
  • 1
    Use e.g. `open(file, 'r+', encoding='utf-8')`. If you don't specify an encoding, python will default to your system encoding, which may not be the same as the one used in the file. Your system encoding is given by `import locale; locale.getpreferredencoding(False)`. – ekhumoro Jul 10 '20 at 09:20

1 Answers1

1

Open your file with codecs and use Unicode encoding. Example:

import codecs
codecs.open('your_filename_here', encoding='utf-8', mode='w+')

Docs: Python Unicode Docs

CFV
  • 740
  • 7
  • 26