I found a list of the majority of English words online, but the line breaks are of unix-style (encoded in Unicode: UTF-8). I found it on this website: http://dreamsteep.com/projects/the-english-open-word-list.html
How do I convert the line breaks to CRLF so I can iterate over them? The program I will be using them in goes through each line in the file, so the words have to be one per line.
This is a portion of the file: bitbackbitebackbiterbackbitersbackbitesbackbitingbackbittenbackboard
It should be:
bit
backbite
backbiter
backbiters
backbites
backbiting
backbitten
backboard
How can I convert my files to this type? Note: it's 26 files (one per letter) with 80,000 words or so in total (so the program should be very fast).
I don't know where to start because I've never worked with unicode. Thanks in advance!
Using rU
as the parameter (as suggested), with this in my code:
with open(my_file_name, 'rU') as my_file:
for line in my_file:
new_words.append(str(line))
my_file.close()
I get this error:
Traceback (most recent call last):
File "<pyshell#5>", line 1, in <module>
addWords('B Words')
File "D:\my_stuff\Google Drive\documents\SCHOOL\Programming\Python\Programming Class\hangman.py", line 138, in addWords
for line in my_file:
File "C:\Python3.3\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 7488: character maps to <undefined>
Can anyone help me with this?