1

I've honestly spent a lot of time on this, and it's slowly killing me. I've stripped content from a PDF and stored it in an array. Now I'm trying to pull it back out of the array and write it into a txt file. However, I do not seem to be able to make it happen because of encoding issues.

allTheNTMs.append(contentRaw[s1:].encode("utf-8"))
for a in range(len(allTheNTMs)):
        kmlDescription = allTheNTMs[a]
        print kmlDescription #this prints out fine
        outputFile.write(kmlDescription)

The error i'm getting is "unicodedecodeerror: ascii codec can't decode byte 0xc2 in position 213:ordinal not in range (128).

I'm just messing around now, but I've tried all kinds of ways to get this stuff to write out.

outputFile.write(kmlDescription).decode('utf-8')          

Please forgive me if this is basic, I'm still learning Python (2.7).

Cheers!

EDIT1: Sample data looks something like the following:

Chart 3686 (plan, Morehead City) [ previous update 4997/11 ] NAD83 DATUM
Insert the accompanying block, showing amendments to coastline,
depths and dolphins, centred on: 34° 41´·19N., 76° 40´·43W.
Delete R 34° 43´·16N., 76° 41´·64W.

When I add the print type(raw), I get

Edit 2: When I just try to write the data, I receive the original error message (ascii codec can't decode byte...)

I will check out the suggested thread and video. Thanks folks!

Edit 3: I'm using Python 2.7

Edit 4: agf hit the nail on the head in the comments below when (s)he noticed that I was double encoding. I tried intentionally double encoding a string that had previously been working and produced the same error message that was originally thrown. Something like:

text = "Here's a string, but imagine it has some weird symbols and whatnot in it - apparently latin-1"
textEncoded = text.encode('utf-8')
textEncodedX2 = textEncoded.encode('utf-8')
outputfile.write(textEncoded) #Works!
outputfile.write(textEncodedX2) #failed

Once I figured out I was trying to double encode, the solution was the following:

allTheNTMs.append(contentRaw[s1:].encode("utf-8"))
for a in range(len(allTheNTMs)):
    kmlDescription = allTheNTMs[a]
    kmlDescriptionDecode = kmlDescription.decode("latin-1")
    outputFile.write(kmlDescriptionDecode)

It's working now, and I sure appreciate all of your help!!

gruvn
  • 692
  • 1
  • 6
  • 25
  • 2
    please provide some sample data ,which you have problem with it. and run "type(raw_data)" and paste result in your question – pylover Mar 27 '12 at 19:45
  • 1
    what happens if you just try to `write` the `contentRaw`? It looks to me like the data is already encoded. – agf Mar 27 '12 at 19:51
  • I solved some identical problems using the `codecs` module, specifically `codecs.open()` and `codecs.write()`. Might be worth taking a look. – heltonbiker Mar 27 '12 at 19:59
  • you may want to have a look at this post: http://stackoverflow.com/a/448383/1025391 – moooeeeep Mar 27 '12 at 20:02
  • what is outputFile ? also can you create a self contained example including data which throws the error? – Anurag Uniyal Mar 27 '12 at 20:10
  • 1
    `contentRaw[s1:]` is not of type `unicode`. When you call `.encode` on a bytes object, Python2 implicitly decodes `str` type (which contains a sequence of bytes) to type `unicode` using the ascii codec, then encodes the unicode to your supplied encoding. See [this pycon video](http://pyvideo.org/video/948/pragmatic-unicode-or-how-do-i-stop-the-pain) – Daenyth Mar 27 '12 at 20:26
  • thanks for sample data, but : what is your outputFile's type? and what is your environment?, in my ubuntu 11.10 wokrs fine. with your sample data,please provide repr(kmlDescription) output instead print(kmlDescription) – pylover Mar 28 '12 at 21:00

2 Answers2

4

My guess is that output file you have opened has been opened with latin1 or even utf-8 codec hence you are not able to write utf-8 encoded data to that because it tries to reconvert it, otherwise to a normally opened file you can write any arbitrary data string, here is an example recreating similar error

u = u'सच्चिदानन्द हीरानन्द वात्स्यायन '
s = u.encode('utf-8')
f = codecs.open('del.text', 'wb',encoding='latin1')
f.write(s)

output:

Traceback (most recent call last):
  File "/usr/lib/wingide4.1/src/debug/tserver/_sandbox.py", line 1, in <module>
    # Used internally for debug sandbox under external interpreter
  File "/usr/lib/python2.7/codecs.py", line 691, in write
    return self.writer.write(data)
  File "/usr/lib/python2.7/codecs.py", line 351, in write
    data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 0: ordinal not in range(128)

Solution:

this will work, if you don't set any codec

f = open('del.txt', 'wb') 
f.write(s)

other option is to directly write to file without encoding the unicode strings, if outputFile has been opened with correct codec e.g.

f = codecs.open('del.text', 'wb',encoding='utf-8')
f.write(u)
Anurag Uniyal
  • 85,954
  • 40
  • 175
  • 219
  • Thanks - I tried this only to get the error "typeerror:encoding is an invalid keyword argument for this function)". Looks like opening with an encoding started in Python 3, and I'm using 2.7. I should have specified that, and will edit my question. – gruvn Mar 28 '12 at 19:28
  • @gruvn I am using python 2.7, to which function you are passing encoding? use codecs.open – Anurag Uniyal Mar 28 '12 at 20:10
  • Oh crap - Sorry Anurag - I was trying: f=open('del.text','wb',encoding='utf-8') instead of f=codecs.open('del.text','wb',encoding='utf-8') I'll have another look. PS: Sorry for the formatting, I can't get the code tags to work! – gruvn Mar 29 '12 at 11:20
  • Hmm, still no luck. when I try `outputFile = codecs.open(outputFileName, "wb",encoding='utf-8')`, I just get the message - "NameError: global name 'codecs' is not defined" – gruvn Mar 30 '12 at 11:23
  • @gruvn did you import codecs module? you will need to import nay module before using it e.g. `import codecs` I recommend you go thru python tutorial first – Anurag Uniyal Mar 30 '12 at 14:37
1

Your error message doesn't seem to appear to relate to any of your Python syntax but actually the fact you're trying to decode a Hex value which has no equivalent in UTF-8.

HEX 0xc2 appears to represent a latin character - an uppercase A with an accent on the top. Therefore, instead of using "allTheNTMs.append(contentRaw[s1:].encode("utf-8"))", try:-

allTheNTMs.append(contentRaw[s1:].encode("latin-1"))

I'm not an expert in Python so this may not work but it would appear you're trying to encode a latin character. Given the error message you are receiving too, it would appear that when trying to encode in UTF-8, Python only looks through the first 128 entries given that your error appears to indicate that entry "0Xc2" is out of range which indeed it is out of the first 128 entries of UTF-8.

thefragileomen
  • 1,537
  • 8
  • 24
  • 40
  • 1
    UTF-8 can represent any unicode code point, so the problem is that he is trying to double encode the data, not what the target encoding is. – agf Mar 27 '12 at 20:04
  • This is incorrect. He's calling the **`.encode`** method and getting a Unicode **Decode** Error. That means there's python2's implicit str/unicode conversion going on. – Daenyth Mar 27 '12 at 20:29