2

I'm trying to get rid of diacritics in my textfile. I converted a pdf to text with a tool, not made by myself. I wasn't able to understand which encoding they use. The text is written in Nahuatl, orthographically familiar with Spanish.

I transformed the text into a list of strings. No I'm trying to do the following:

# check whether there is a not-ascii character in the item
def is_ascii(word):
    check = string.ascii_letters + "."
    if word not in check:
        return False
    return True

# if there is a not ascii-character encode the string 
def to_ascii(word):
    if is_ascii(word) == False:
        newWord = word.encode("utf8")
        return newWord
    return word

What I want to get is a unicode-version of my string. It doesn't work so far and I tried several encodings like latin1, cp1252, iso-8859-1. What I get is Can anybody tell me what I did wrong?

How can I find out the right encoding?

Thank you!

EDIT: I wrote to the people that developed the converter (pdf-txt) and they said they were using unicode already. So John Machin was right with (1) in his answer. As I wrote in some comment that wasn't clear to me, because in the Eclipse debugger the list itself showed some signs in unicodes, others not. And if I looked at the items seperately they were all decoded in some way, so that I actually saw unicode.

Thank you for your help!

Community
  • 1
  • 1
Rattlesnake
  • 143
  • 3
  • 12
  • 2
    See [Pragmatic Unicode, or, How do I stop the pain?](http://nedbatchelder.com/text/unipain.html) by Ned Batchelder. Fact of Life #4 ("You cannot infer the encoding of bytes; You must be told, or you have to guess") is directly relevant but it seems you could use the rest as well. –  Feb 22 '13 at 19:30
  • Note that encoding-guessing libraries that use statistical info about character frequencies and combinations is unlikely to work as well for Nahuatl as it would for English. The unfortunate fact is that most of the effort toward guessing encodings has been focused on documents whose text is in one of the handful of major world languages. – BrenBarn Feb 22 '13 at 19:35
  • @user1986412: can you make your file available for analysis? – John Machin Feb 22 '13 at 23:36

2 Answers2

1

If you have read some bytes and want to interpret them as an unicode string, then you have to use .decode() rather than encode().

Like @delnan said in the comment, I hope you know the encoding. If not, the guesswork should go easy once you fix the function used.

BTW even if there are only ASCII characters in that word, why not .decode() it too? You'd have the same data type (unicode) everywhere, which will make your program simpler.

Kos
  • 70,399
  • 25
  • 169
  • 233
  • What makes me so confused is that I split my text in words and store them in a list. And inside this list seems to be utf-8. Example: word is stored as: est\\xc3\\xa1n But when I iterate through the list and work with the word as item it will be displayed like this: "str: están" – Rattlesnake Feb 22 '13 at 19:40
  • And why does that surprise you? There's nothing wrong with keeping UTF-8 strings in `str` variables as long as you're consistent with processing it. Using the built-in `unicode` type makes processing easier, though. – Kos Feb 22 '13 at 20:16
1

Edit your question to show the version of Python you are using. Guessing the version from your code is not possible. Whether you are using Python 3.X or 2.X matters a lot. Following remarks assume Python 2.x.

You already seem to have determined that you have UTF-8 encoded text. Try the_text.decode('utf8'). Note decode, NOT encode.

If decoding with UTF-8 does not raise UnicodeDecodeError and your text is not trivially short, then it is very close to certain that UTF-8 is the correct encoding.

If the above does not work, show us the result of print repr(the_text).

Note that it is counter-productive trying to check whether the file is encoded in ASCII -- ASCII is a subset of UTF-8. Leaving some data as str objects and other as unicode is messy in Python 2.x and won't work in Python 3.X

In any case, your first function doesn't do what you think it does; it returns False for any input string whose length is 2 or more. Please consider unit-testing functions as you write them; it makes debugging much faster later on.

Note that latin1 and iso-8859-1 are the same encoding. As latin1 encodes the first 256 codepoints in Unicode in the same order, then it is impossible to get UnicodeDecodeError raised by text.decode('latin1'). "No error" is this case has exactly zero diagnostic value.

Update in response to this comment from OP:

I use Python 2.7. If I use text.decode("utf8") it raises the following error: UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2014' in position 0: ordinal not in range(256).

That can happen two ways:

(1) In a single statement like foo = text.decode('utf8'), text is already a unicode object so Python 2.X tries to encode it using the default encoding (latin-1 ???).

(2) Possibly in two different statements, first foo = text.decode('utf8') where text is an str object encoded in UTF-8, and this statement doesn't raise an error, followed by something like print foo and your sys.stdout.encoding is latin-1 (???).

I can't imagine why you have "ticked" my answer as correct. Nobody knows what the question is yet!

Please edit your question to show your code (insert print repr(text) just before the text.decode("utf8") line), and the result of running it. Show the repr() result and the full traceback (so that we can determine which line is causing the error).

I ask again: can you make your file available for analysis?

By the way, u'\u2014' is an "EM DASH" and is a valid character in cp1252 (but not in latin-1, as you have seen from the error message). What version of what operating system are you using?

And to answer your last question, NO, you must NOT attempt to decode your text using every codec in the known universe. You are ALREADY getting plausible Unicode; something (your code?) is decoding something somehow -- the presence of u'\u2014' is enough evidence of that. Just show us your code and its results.

John Machin
  • 81,303
  • 11
  • 141
  • 189
  • No, unfortunately I haven't figured out the encoding yet. I use Python 2.7. If I use text.decode("utf8") it raises the following error: UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2014' in position 0: ordinal not in range(256). Ok, so it seems not to be utf8 - must I no randomly check every encoding possible? Like to be found in this list: http://docs.python.org/2/library/codecs.html – Rattlesnake Feb 24 '13 at 19:32