0

I have a large string of a novel that I downloaded from Project Gutenberg. I am trying to save it to my computer, but I'm getting a UnicodeEncodeError and I don't know how to fix or ignore it.

from urllib import request

# Get the text
response = request.urlopen('http://www.gutenberg.org/files/2701/2701-0.txt')
# Decode it using utf8
raw = response.read().decode('utf8')
# Save the file
file = open('corpora/canon_texts/' + 'test', 'w')
file.write(raw)
file.close()

This gives me the following error:

UnicodeEncodeError: 'charmap' codec can't encode character '\ufeff' in position 0: character maps to <undefined>

First, I tried to remove the BOM at the beginning of the file:

# We have to get rid of the pesky Byte Order Mark before we save it
raw = raw.replace(u'\ufeff', '')

but I get the same error, just with a different position number:

UnicodeEncodeError: 'charmap' codec can't encode characters in position 7863-7864: character maps to <undefined>

If I look in that area I can't find the offending characters, so I don't know what to remove:

raw[7850:7900]

just prints out:

'   BALLENA,             Spanish.\r\n     PEKEE-NUEE-'

which doesn't look like it would be a problem.

So then I tried to skip the bad lines with a try statement:

file = open('corpora/canon_texts/' + 'test', 'w')
try:
    file.write(raw)
except UnicodeEncodeError:
    pass
file.close()

but this skips the entire text, giving me a file of 0 size.

How can I fix this?

EDIT:

A couple people have noted that '\ufeff' is utf16. I tried switching to utf16:

# Get the text
response = request.urlopen('http://www.gutenberg.org/files/2701/2701-0.txt')
# Decode it using utf16
raw = response.read().decode('utf-16')

But I can't even download the data before I get this error:

UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x0a in position 1276798: truncated data

SECOND EDIT:

I also tried decoding with utf-8-sig as suggested in u'\ufeff' in Python string because that includes BOM, but then I'm back to this error when I try to save it:

UnicodeEncodeError: 'charmap' codec can't encode characters in position 7863-7864: character maps to <undefined>
Community
  • 1
  • 1
jss367
  • 4,759
  • 14
  • 54
  • 76
  • Because `\ufeff` is the BOM for utf-16 and you're trying to decode it as utf-8. – TemporalWolf Apr 08 '17 at 10:56
  • Possible duplicate of [u'\ufeff' in Python string](http://stackoverflow.com/questions/17912307/u-ufeff-in-python-string) – TemporalWolf Apr 08 '17 at 11:05
  • If you want to see what characters are causing the exception, try `ascii(raw[7850:7900])`. There might be an unusual white-space character or a soft hyphen or the like, which you can't see with the standard `repr()` form. – lenz Apr 08 '17 at 22:54

3 Answers3

1

Decoding and re-encoding a file just to save it to disk is pointless. Just write out the bytes you have downloaded, and you will have the file on disk:

raw = response.read()
with open('corpora/canon_texts/' + 'test', 'wb') as outfile:
    outfile.write(raw)

This is the only reliable way to write to disk exactly what you downloaded.

Sooner or later you'll want to read in the file and work with it, so let's consider your error. You didn't provide a full stack trace (always a bad idea), but your error is during encoding, not decoding. The decoding step succeeded. The error must be arising on the line file.write(raw), which is where the text gets encoded for saving. But to what encoding is it being converted? Nobody knows, because you opened file without specifying an encoding! The encoding you're getting depends on your location, OS, and probably the tides and weather forecast. In short: Specify the encoding.

text = response.read().decode('utf8')
with open('corpora/canon_texts/' + 'test', 'w', encoding="utf-8") as outfile:
    outfile.write(text)
alexis
  • 48,685
  • 16
  • 101
  • 161
  • Yeah, that's it: the default for the output encoding in the OP's environment is apparently some 8-bit encoding, as the error message ('charmap' codec) suggests, which can't handle some characters used. The input encoding should be 'utf-8-sig' though, since the BOM is not something you want to have in a decoded string. – lenz Apr 08 '17 at 22:48
  • Saving it to disk using your approach works, but when I try to read it with your code it wipes the file and the variable text is just an empty string – jss367 Apr 13 '17 at 04:43
  • I think you are confused about the process of reading and writing files. Opening a file in "w" or "wb" mode will indeed immediately wipe out its prior contents. To read a file, open it in "r" mode (or with no mode, which defaults to "r"); read the Python tutorial on [input and output](https://docs.python.org/3/tutorial/inputoutput.html) for the rest. Anyway _this_ question was about how to download a file and save it to disk. Use Notepad (or another editor) to confirm that the above code works and the file is downloaded. If you then have problems reading your files, ask a new question. – alexis Apr 13 '17 at 12:02
-1

U + feff is for UTF-16. Try that instead.

Baryo
  • 314
  • 1
  • 9
-1

.decode(encoding="utf-8", errors="strict") offers error handling as a built-in feature:

The default for errors is 'strict', meaning that encoding errors raise a UnicodeError. Other possible values are 'ignore', 'replace' and any other name registered via codecs.register_error(), see section Error Handlers.

Probably the safest option is

decode("utf8", errors='backslashreplace')

which will escape encoding errors with a backslash, so you have a record of what failed to decode.

Conveniently, your Moby Dick text contains no backslashes, so it will be quite easy to check what characters are failing to decode.

What is strange about this text is the website says it is in utf-8, but \efeff is the BOM for utf-16. Decoding in utf-16, it looks like your just having trouble with the very last character 0x0a (which is a utf-8 line ending), which can probably safely be dropped with

decode("utf-16", errors='ignore')
TemporalWolf
  • 7,727
  • 1
  • 30
  • 50
  • The character `'\ufeff'` is the BOM, but since this is a decoded string, you can't say it is the BOM for UTF-16. (The BOM for UTF-16 [big endian] is, if you want, the byte sequence `b'\xfe\xff'`.) The BOM shouldn't be present in a decoded string, though, so the correct input encoding is probably `utf-8-sig`. If the input was UTF-16, as you suspect, then decoding with UTF-8 would have immediately failed, since `b'\xfe\xff'` is not a valid UTF-8 byte sequence. – lenz Apr 08 '17 at 22:38