Edit your question to show the version of Python you are using. Guessing the version from your code is not possible. Whether you are using Python 3.X or 2.X matters a lot. Following remarks assume Python 2.x.
You already seem to have determined that you have UTF-8 encoded text. Try the_text.decode('utf8')
. Note decode, NOT encode.
If decoding with UTF-8 does not raise UnicodeDecodeError
and your text is not trivially short, then it is very close to certain that UTF-8 is the correct encoding.
If the above does not work, show us the result of print repr(the_text)
.
Note that it is counter-productive trying to check whether the file is encoded in ASCII -- ASCII is a subset of UTF-8. Leaving some data as str
objects and other as unicode
is messy in Python 2.x and won't work in Python 3.X
In any case, your first function doesn't do what you think it does; it returns False
for any input string whose length is 2 or more. Please consider unit-testing functions as you write them; it makes debugging much faster later on.
Note that latin1
and iso-8859-1
are the same encoding. As latin1
encodes the first 256 codepoints in Unicode in the same order, then it is impossible to get UnicodeDecodeError
raised by text.decode('latin1')
. "No error" is this case has exactly zero diagnostic value.
Update in response to this comment from OP:
I use Python 2.7. If I use text.decode("utf8") it raises the following
error: UnicodeEncodeError: 'latin-1' codec can't encode character
u'\u2014' in position 0: ordinal not in range(256).
That can happen two ways:
(1) In a single statement like foo = text.decode('utf8')
, text
is already a unicode object so Python 2.X tries to encode it using the default encoding (latin-1 ???).
(2) Possibly in two different statements, first foo = text.decode('utf8')
where text
is an str
object encoded in UTF-8, and this statement doesn't raise an error, followed by something like print foo
and your sys.stdout.encoding is latin-1
(???).
I can't imagine why you have "ticked" my answer as correct. Nobody knows what the question is yet!
Please edit your question to show your code (insert print repr(text)
just before the text.decode("utf8")
line), and the result of running it. Show the repr() result and the full traceback (so that we can determine which line is causing the error).
I ask again: can you make your file available for analysis?
By the way, u'\u2014'
is an "EM DASH" and is a valid character in cp1252
(but not in latin-1
, as you have seen from the error message). What version of what operating system are you using?
And to answer your last question, NO, you must NOT attempt to decode your text using every codec in the known universe. You are ALREADY getting plausible Unicode; something (your code?) is decoding something somehow -- the presence of u'\u2014'
is enough evidence of that. Just show us your code and its results.