1

I'm trying to create a little program that reads the contents of two stories, Alice in Wonderland & Moby Dick, and then counts how many times the word 'the' is found in each story.

However I'm having issues with getting Geany text editor to open the files. I've been creating and using my own small text files with no issues so far.

with open('alice_test.txt') as a_file:
    contents = a_file.readlines()

print(contents)

I get the following error:

Traceback (most recent call last):
  File "add_cats_dogs.py", line 50, in <module>
    print(contents)
  File "C:\Users\USER\AppData\Local\Programs\Python\Python35-32\lib\encodings\cp437.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2018' in position 279: character maps to <undefined>

As I said, no issues experienced with any small homemade text files.

Strangely enough, when I excecute the above code in Python IDLE, I have no problems, even if I change the text file's encoding between UTF-8 to ANSII.

I tried encoding the text file as UTF-8 and ANSII, I also checked to make sure the default encoding of Geany is UTF-8 (also tried without using default encoding), as well using and not using fixed encoding when opening non-Unicode files.

I get the same error every time. The text file was from gutenberg.org, I tried using another file from there and got the same issue.

I know it must be some sort of issue between Geany and the text file, but I can't figure out what.

EDIT: I found a sort of fix. Here is the text that was giving me problems:https://www.gutenberg.org/files/11/11-0.txt Here is the text that I can use without problems:http://www.textfiles.com/etext/FICTION/alice13a.txt Top one is encoded in UTF-8, bottom one is encoded in windows-1252. I would've imagined the reverse to be true, but for whatever reason the UTF-8 encoding seems to be causing the problem.

Cyanidies
  • 57
  • 1
  • 2
  • 9
  • I think you can find solution there http://stackoverflow.com/questions/14630288/unicodeencodeerror-charmap-codec-cant-encode-character-maps-to-undefined – Warager Oct 10 '16 at 15:20
  • Actually already had a look at some of those, unfortunately my Python skills are quite basic, so trying to implement what they suggest is very confusing and doesn't seem to help. – Cyanidies Oct 10 '16 at 17:24
  • There's no error when opening the file, it's just that the encoding used by your console (cp437) can't encode that character. Do you really need to print the text in order to read the files and count words? – Stop harming Monica Oct 10 '16 at 19:56

1 Answers1

0

What OS do you use? There are similar problems in Windows. If so, you can try to run chcp 65001 before you command in console. Also you can add # encoding: utf-8 at the top of you .py file. Hope this will help because I can't reply same encoding problem with .txt file from gutenberg.org on my machine.

Warager
  • 161
  • 1
  • 2
  • 7
  • I use Windows, sorry should've specified. Unfortunately neither of those options work. Unless I'm misunderstanding, if you could clarify how you should run something on the console before executing a .py file? – Cyanidies Oct 11 '16 at 06:39
  • You can try using `chcp 65001` in the Windows console to switch your codepage; chcp is a windows command line command to change code pages. – Warager Oct 11 '16 at 06:56
  • You can also add encode ignore expression to your print command. `print (contents.encode('cp437', 'ignore'))` – Warager Oct 11 '16 at 07:18