0

I have this file, testpi.txt, which i'd like to convert into a list of sentences.

 >>>cat testpi.txt  
    This is math π.  
    That is moth pie.

Here's what I've done:

    r = open('testpi.txt', 'r')
    sentence_List = r.readlines()
    print sentence_List  

And, when the output is sent to another text file - output.txt , this is how it looks like in output.txt:
['This is math \xcf\x80. That is moth pie.\n']

I tried codecs too, r = codecs.open('testpi.txt', 'r',encoding='utf-8'),
but the output then consists of a leading 'u' in all the entries.

How could I display this byte string - \xcf\x80 as π, in the output.txt

Please guide me, thanks.

abT
  • 7
  • 4

1 Answers1

1

The problem is you're printing the entire list which gives you an output format you don't want. Instead, print each string individually and it will work:

r = open('t.txt', 'r')
sentence_List = r.readlines()
for line in sentence_List:
    print line,

Or:

print "['{}']".format("', '".join(map(str.rstrip, sentence_List)))
John Zwinck
  • 239,568
  • 38
  • 324
  • 436
  • Thanks a lot, for answering. However, I need the output as a List of sentences, that would be further processed. I went ahead with the earlier mentioned 'output.txt', but while calculating the word frequencies, π isn't there instead '\xcf\x80' gets displayed and that's not desired. Could you please suggest something. – abT May 11 '16 at 09:11
  • Can I get list in a text file, something like this -> ['This is math π. That is moth pie.\n'] – abT May 11 '16 at 09:15
  • @abT: I edited the last example code in my answer to do what I think you want: print it "like a list" but with the fancy characters preserved. – John Zwinck May 11 '16 at 09:15
  • Thanks John, that was all I wanted. I got output list -> ['This is math π.', 'That is moth pie.'] – abT May 11 '16 at 09:21
  • 1
    @abT: if you won't use `io.open()` then you may get mojibake if `testpi.txt` encoding is different the console encoding. Related: [Removing u in list](http://stackoverflow.com/a/33423708/4279) – jfs May 11 '16 at 22:02
  • @J.F. Sebastian, Thanks!. I read `Removing u in list` and from there I read your `Print unicode directly` post. I added `PYTHONIOENCODING= "utf-8"` in one of my py files and I've to no longer use any `.decode('utf-8')` or `.encode('utf-8')` . But, still, I need to use `print (i.get('Body')).encode('utf-8')` in one py file where I am extracting contents of `Body` tag from an xml and printing it to a text file...any suggestions why `PYTHONIOENCODING= "utf-8"` isn't working there. Should I edit my question and put everything there along with the code? – abT May 14 '16 at 18:09
  • @abT it should be asked as a separate question. Create a minimal code example that shows your issue. "Isn't working" is not very informative: show how you run the script. Describe using words what do you expect to happen? What happens instead(step by step)? Provide example input/output, error messages if any. Mention your OS, Python versions and relevant environment variables such as LC_ALL, LC_CTYPE, LANG. – jfs May 14 '16 at 18:28
  • @J.F. Sebastian : Sure, I've asked a new question with all the relevant details posted there. [http://stackoverflow.com/questions/37230865/unicodeencodeerror-ascii-codec-cant-encode-character-u-u03c0] – abT May 14 '16 at 19:19