1

Objective: I want to read the text from a word file and then increment the ascii value of each character by some predefined number(sort of encoding) and save it into the same file itself. For eg : 'A' has an ascii of 65 so I need that to become 75. I'm writing this following code and is stuck at it at this point. `

import docx
from docx import Document
data = Document("C:\Python27\Testing.docx")
for n in data.paragraphs:
    temp= n.text
for d in temp:
    try:
        temp1 = str(temp)
    except UnicodeEncodeError:
        temp1 = temp.encode('ascii','replace')
        pass
print temp1

Now the output which I get is like this

This is just a test of what I?m gonna make. Fingers crossed?

and the original string is

This is just a test of what I’m gonna make. Fingers crossed…

how can I replace the Unicode characters with the corresponding ascii characters so that I can proceed ahead? Please provide some suggestions.

NISHIT KHARA
  • 63
  • 1
  • 10
  • What do you mean by replacing Unicode with Ascii characters? These two are completely different ... – linusg Nov 25 '16 at 12:35
  • When i'm type casting it as string the Unicode characters are not getting converted into string and it gives and UnicodeEncodeError. So I want to convert that characters also into string characters. – NISHIT KHARA Nov 25 '16 at 12:38
  • You can't have unicode chars in an ascii string that are not in the ascii codec! (or I do not understand what you want to achieve...) – linusg Nov 25 '16 at 12:42
  • The temp variable in the code is of Unicode datatype. I want to convert it to string datatype. Now when I typecast using `str(temp)` ,some characters like the single quotes (in my example) are not getting casted. So is there any way to cast such characters?. – NISHIT KHARA Nov 25 '16 at 12:47
  • what do you get if you do a temp1 = temp.decode() in the exception block? – themistoklik Nov 25 '16 at 12:48
  • I commented the encode and statement and did temp1 = temp.decode()`Traceback (most recent call last): File "C:\Python27\Testing_1.py", line 16, in temp1 = temp.decode() UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 29: ordinal not in range(128)` – NISHIT KHARA Nov 25 '16 at 12:50
  • http://stackoverflow.com/a/9942822/3025412 does this help? – themistoklik Nov 25 '16 at 12:50
  • or encode them in utf-8 instead of ascii – themistoklik Nov 25 '16 at 12:51
  • I tried the utf-8 instead of ascii. I got the following result `This is just a test of what I’m gonna make. Fingers crossed… ` – NISHIT KHARA Nov 25 '16 at 12:54
  • is that result in the terminal or in a file? you could have not set the sys encoding to utf-8 – themistoklik Nov 25 '16 at 13:01
  • My default encoding is in ascii. I checked that using `sys.getdefaultencoding()` and I have read that we should not play with the setdefaultencoding(). Also it has been removed from the sys. – NISHIT KHARA Nov 25 '16 at 13:48
  • A docx, fundamentally has Unicode text. You should revisit/refine your requirements. A Ceaser cipher is best defined to transform only certain alphabets and pass every other character through unchanged. The [Basic Latin](http://www.unicode.org/charts/nameslist/index.html) uppercase and lowercase letters for example. (I don't know Python but maybe you should explore whether Python 3 has better support for Unicode.) – Tom Blodget Nov 25 '16 at 16:33

0 Answers0