0

I'm trying to manipulate a text file with song names. I want to clean up the data, by changing all the spaces and tabs into +.

This is the code:

input = open('music.txt', 'r')
out = open("out.txt", "w")
for line in input:
    new_line = line.replace(" ", "+")
    new_line2 = new_line.replace("\t", "+")
    out.write(new_line2)
    #print(new_line2)
fh.close()
out.close()

It gives me an error:

Traceback (most recent call last):
  File "music.py", line 3, in <module>
    for line in input:
  File "C:\Users\nfeyd\AppData\Local\Programs\Python\Python36\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 2126: character maps to <undefined>

As music.txt is saved in UTF-8, I changed the first line to:

input = open('music.txt', 'r', encoding="utf8")

This gives another error:

UnicodeEncodeError: 'charmap' codec can't encode character '\u039b' in position 21: character maps to <undefined>

I tried other things with the out.write() but it didn't work.

This is the raw data of music.txt. https://pastebin.com/FVsVinqW

I saved it in windows editor as UTF-8 .txt file.

tripleee
  • 175,061
  • 34
  • 275
  • 318
Loewe8
  • 71
  • 9
  • Related: https://stackoverflow.com/questions/42070668/python-3-default-encoding-cp1252 – tripleee Mar 02 '21 at 14:49
  • @tripleee Windows uses Unicode since 1994. It's Python (or rather, Python devs that came from Linux) that causes issues. And if you consider that after 26 years developers still manage to write ASCII in a Unicode OS, Python development will have to deal with this problem for some time – Panagiotis Kanavos Mar 02 '21 at 14:52
  • @tripleee I'd say that problematic applications on Windows disappeared fairly quickly (definitely before 2005), as Unicode languages like Java and C# took over. Even VB6 was a Unicode language. It was the *non-Unicode* languages that caused issues, like C/C++, Delphi and Python. By 2005, such programs had either converted to Unicode or went out of business. For Windows, with multilingual users and a global audience a hard-coded `LC_ALL` was never an option. – Panagiotis Kanavos Mar 02 '21 at 14:57
  • Not sure what you are trying to say here; `LC_ALL` was certainly always per-user and you could change it between every system call if you wanted to, though that would obviously produce results which are unfit for human interaction. It's probably true that the mismatch between Microsoft's locale model and Python's is a source of friction, perhaps without either being obviously wrong (though my impression is that UTF-8 is still hard on Windows as of 2021, albeit perhaps not if you can upgrade all of your fleet to the latest supported OS). – tripleee Mar 02 '21 at 15:03
  • The first problem is that the file uses UTF-8 encoding, whereas the default encoding is system dependent. The second problem is, essentially, that your terminal doesn't know how to display the `Λ` character. Each of these problems is a common duplicate with a high-quality canonical, which I have linked accordingly. – Karl Knechtel Mar 16 '23 at 01:48

1 Answers1

1

If your system's default encoding is not UTF-8, you will need to explicitly configure it for both the filehandles you open, on legacy versions of Python 3 on Windows.

with open('music.txt', 'r', encoding='utf-8') as infh,\
        open("out.txt", "w", encoding='utf-8') as outfh:
    for line in infh:
        line = line.replace(" ", "+").replace("\t", "+")
        outfh.write(line)

This demonstrates how you can use fewer temporary variables for the replacements; I also refactored to use a with context manager, and renamed the file handle variables to avoid shadowing the built-in input function.

Going forward, perhaps a better solution would be to upgrade your Python version; my understanding is that Python should now finally offer UTF-8 by default on Windows, too.

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • Oh, Stack Overflow renders tab as a space, too. Updated. – tripleee Mar 02 '21 at 15:21
  • 1
    FYI, UTF-8 is not the default for `open` on Windows as of Python 3.9, but since Python 3.7 setting the environment variable `PYTHONUTF8=1` makes it the default. See [PEP 540](https://www.python.org/dev/peps/pep-0540/) – Mark Tolonen Mar 02 '21 at 16:58