12

I found a list of the majority of English words online, but the line breaks are of unix-style (encoded in Unicode: UTF-8). I found it on this website: http://dreamsteep.com/projects/the-english-open-word-list.html

How do I convert the line breaks to CRLF so I can iterate over them? The program I will be using them in goes through each line in the file, so the words have to be one per line.

This is a portion of the file: bitbackbitebackbiterbackbitersbackbitesbackbitingbackbittenbackboard

It should be:

bit
backbite
backbiter
backbiters
backbites
backbiting
backbitten
backboard

How can I convert my files to this type? Note: it's 26 files (one per letter) with 80,000 words or so in total (so the program should be very fast).

I don't know where to start because I've never worked with unicode. Thanks in advance!

Using rU as the parameter (as suggested), with this in my code:

with open(my_file_name, 'rU') as my_file:
    for line in my_file:
        new_words.append(str(line))
my_file.close()

I get this error:

Traceback (most recent call last):
  File "<pyshell#5>", line 1, in <module>
    addWords('B Words')
  File "D:\my_stuff\Google Drive\documents\SCHOOL\Programming\Python\Programming Class\hangman.py", line 138, in addWords
    for line in my_file:
  File "C:\Python3.3\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 7488: character maps to <undefined>

Can anyone help me with this?

Rushy Panchal
  • 16,979
  • 16
  • 61
  • 94
  • you could possibly find this http://stackoverflow.com/questions/3891076/how-to-convert-windows-end-of-line-in-unix-end-of-line-cr-lf-to-lf helpful – dmi3y Dec 19 '12 at 14:49
  • 1
    Can't you make your program able to handle both types of line ending? – James M Dec 19 '12 at 14:49
  • @JamesMcLaughlin I already have a file with a list of words. In addition, I've never used unicode (as stated) so I don't know how to handle those types of endings. – Rushy Panchal Dec 19 '12 at 14:51
  • 1
    In unix use the sed command – Adrian Dec 19 '12 at 14:54

4 Answers4

18

Instead of converting, you should be able to just open the file using Python's universal newline support:

f = open('words.txt', 'rU')

(Note the U.)

NPE
  • 486,780
  • 108
  • 951
  • 1,012
15

You can use the replace method of strings. Like

txt.replace('\n', '\r\n')

EDIT :
in your case :

with open('input.txt') as inp, open('output.txt', 'w') as out:
    txt = inp.read()
    txt = txt.replace('\n', '\r\n')
    out.write(txt)
dugres
  • 12,613
  • 8
  • 46
  • 51
  • If you want to change all the line endings in the same file without creating a new output file, look at my answer here: http://stackoverflow.com/a/43678795/3459910 – winklerrr Apr 28 '17 at 11:17
2

You don't need to convert the line endings in the files in order to be able to iterate over them. As suggested by NPE, simply use python's universal newlines mode.

The UnicodeDecodeError happens because the files you are processing are encoded as UTF-8 and when you attempt to decode the contents from bytes to a string, via str(line), Python is using the cp1252 encoding to convert the bytes read from the file into a Python 3 string (i.e. a sequence of unicode code points). However, there are bytes in those files that cannot be decoded with the cp1252 encoding and that causes a UnicodeDecodeError.

If you change str(line) to line.decode('utf-8') you should no longer get the UnicodeDecodeError. Check out the Text Vs. Data Instead of Unicode Vs. 8-bit writeup for some more details.

Finally, you might also find The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky useful.

1

You can use cereja package

pip install cereja==1.2.0

import cereja cereja.lf_to_crlf(dir_or_file_path)

or

cereja.lf_to_crlf(dir_or_file_path, ext_in=[“.py”,”.csv”])

You can substitute for any standard. See the filetools module

Joab Leite
  • 84
  • 3