readlines() function and unicodes

Question

I have this file, testpi.txt, which i'd like to convert into a list of sentences.

 >>>cat testpi.txt  
    This is math π.  
    That is moth pie.

Here's what I've done:

    r = open('testpi.txt', 'r')
    sentence_List = r.readlines()
    print sentence_List

And, when the output is sent to another text file - output.txt , this is how it looks like in output.txt:
['This is math \xcf\x80. That is moth pie.\n']

I tried codecs too, r = codecs.open('testpi.txt', 'r',encoding='utf-8'),
but the output then consists of a leading 'u' in all the entries.

How could I display this byte string - \xcf\x80 as π, in the output.txt

Please guide me, thanks.

John Zwinck · Accepted Answer · 2016-05-11T09:15:31.963

1

The problem is you're printing the entire list which gives you an output format you don't want. Instead, print each string individually and it will work:

r = open('t.txt', 'r')
sentence_List = r.readlines()
for line in sentence_List:
    print line,

Or:

print "['{}']".format("', '".join(map(str.rstrip, sentence_List)))

edited May 11 '16 at 09:15

answered May 11 '16 at 08:44

John Zwinck

239,568
38
324
436

Thanks a lot, for answering. However, I need the output as a List of sentences, that would be further processed. I went ahead with the earlier mentioned 'output.txt', but while calculating the word frequencies, π isn't there instead '\xcf\x80' gets displayed and that's not desired. Could you please suggest something. – abT May 11 '16 at 09:11
Can I get list in a text file, something like this -> ['This is math π. That is moth pie.\n'] – abT May 11 '16 at 09:15
@abT: I edited the last example code in my answer to do what I think you want: print it "like a list" but with the fancy characters preserved. – John Zwinck May 11 '16 at 09:15
Thanks John, that was all I wanted. I got output list -> ['This is math π.', 'That is moth pie.'] – abT May 11 '16 at 09:21
1

@abT: if you won't use `io.open()` then you may get mojibake if `testpi.txt` encoding is different the console encoding. Related: [Removing u in list](http://stackoverflow.com/a/33423708/4279) – jfs May 11 '16 at 22:02
@J.F. Sebastian, Thanks!. I read `Removing u in list` and from there I read your `Print unicode directly` post. I added `PYTHONIOENCODING= "utf-8"` in one of my py files and I've to no longer use any `.decode('utf-8')` or `.encode('utf-8')` . But, still, I need to use `print (i.get('Body')).encode('utf-8')` in one py file where I am extracting contents of `Body` tag from an xml and printing it to a text file...any suggestions why `PYTHONIOENCODING= "utf-8"` isn't working there. Should I edit my question and put everything there along with the code? – abT May 14 '16 at 18:09
@abT it should be asked as a separate question. Create a minimal code example that shows your issue. "Isn't working" is not very informative: show how you run the script. Describe using words what do you expect to happen? What happens instead(step by step)? Provide example input/output, error messages if any. Mention your OS, Python versions and relevant environment variables such as LC_ALL, LC_CTYPE, LANG. – jfs May 14 '16 at 18:28
@J.F. Sebastian : Sure, I've asked a new question with all the relevant details posted there. [http://stackoverflow.com/questions/37230865/unicodeencodeerror-ascii-codec-cant-encode-character-u-u03c0] – abT May 14 '16 at 19:19

readlines() function and unicodes

1 Answers1