python file reading and splitting the words

Question

I am reading a file in python and splitting the file with '\n' . when i am printing the splitted list it is giving 'Magni\xef\xac\x81cent Mary' instead of 'Magnificient Mary'

Here is my code...

with open('/home/naveen/Desktop/answer.txt') as ans:
    content = ans.read()
content = content.split('\n')
print content

note: answer.txt contains following lines

Magniﬁcent Mary

Flying Sikh

Payyoli Express

Here is my output of the program

Emily Parker · Answer 1 · 2018-04-02T08:20:38.137

0

the problem is in your text file. There are some unicodes characters in "Magniﬁcent Mary" If you fix that your program should work. If you want to read with unicodes characters, you have to properly decode texts to UTF-8.

Have a look at this one (assuming you want to use python 2) Backporting Python 3 open(encoding="utf-8") to Python 2

python2

with codecs.open(filename='/Users/emily/Desktop/answers.txt', mode='rb', encoding='UTF-8') as ans:
  content = ans.read().splitlines()
  for i in content: print i

If you can use python3, you can actually do this:

with open('/home/naveen/Desktop/answer.txt', encoding='UTF-8') as ans:
  content = ans.read().splitlines()
print(content)

edited Apr 02 '18 at 08:20

answered Apr 02 '18 at 06:41

Emily Parker

291
1
5

I have tried with io.open and codecs.open but problem is not resolving still getting same error . In python 3 it is working .Is there any solution for python 2.7 ? – Naveen Tummidi Apr 02 '18 at 07:47
I think the problem might have been due to printing out `content` as a list. Trying printing out the actual item inside the list. It should work. instead of `print content` try `for i in content: print i` I edited the answer with this addition. – Emily Parker Apr 02 '18 at 08:19
yeah it works but i need to make sure the items are in list only because i need to use the list items in my program – Naveen Tummidi Apr 02 '18 at 08:28
The issue only persists if you are using the value directly printed out from list. As long as you don use the output of `print content`, shouldn't that be fine? – Emily Parker Apr 02 '18 at 09:03
But i need to take each value by using it's index. It is easy to take index of each item after splitting so that's why i am splitting the string and taking the items into a list – Naveen Tummidi Apr 02 '18 at 09:06
Do you mean you need to the item in the content by using index like `content[0]`? It should still works. – Emily Parker Apr 02 '18 at 09:19
yeah it works but suppose i want to take first index of the list which is 'Magnificent Mary' and i want to compare with 'Magnificent Mary'. I have tried but it is giving 'false' because both are not same when comparing – Naveen Tummidi Apr 02 '18 at 09:22
yep indeed they are not the same. because 'Magniﬁcent Mary' from the text file has unicode character **f** in there. You must have copied the text from somewhere and paste it in the file. If you just open the answers.txt file and type 'Magniﬁcent Mary' by hand, it should return True. – Emily Parker Apr 02 '18 at 09:27
yes but i want to encode the text programatically in python 2.7 only is there any solution? – Naveen Tummidi Apr 02 '18 at 09:28
You can decode string to unicode using `content[0].decode('UTF-8')` – Emily Parker Apr 02 '18 at 09:36
think i want to loop over the content and take each one by using its index.decode('UTF-8) and i want to append to new list but when i am printing the appended list it is giving same problem – Naveen Tummidi Apr 02 '18 at 09:44

whiteFang · Answer 2 · 2018-04-02T09:27:34.550

0

There is a problem with your 'f' in Magniﬁcent Mary . It is not the normal f , but it is the LATIN SMALL LIGATURE FI . You can simply delete your 'f' and retype it in gedit. To verify the difference , simply include

print [(ord(a),a) for  a in (file.split("\n"))[0]]

at the end of your code for both the fs.

If there is no way to edit the file , you could first convert the string to unicode , and then use the unicodedata of python.

import unicodedata
file  = open("answer.txt")
file = (file.read()).decode('utf-8')
print unicodedata.normalize('NFKD', 
file).encode('ascii','ignore').split("\n")

edited Apr 02 '18 at 09:27

answered Apr 02 '18 at 07:55

whiteFang

57
5

ok it is working manually but i want to use the list items in my program – Naveen Tummidi Apr 02 '18 at 08:29
Is the text file auto generated through a python code? Post the code if yes , otherwise ,explain the source of creation text file. – whiteFang Apr 02 '18 at 08:42
yes it is auto generated. I am using tesseract ocr to convert image to text file – Naveen Tummidi Apr 02 '18 at 08:46

python file reading and splitting the words

note: answer.txt contains following lines

2 Answers2