0

I have a text file with Hindi text lines(about 5400000 lines) in it. I want to save these lines in a string array in python. I tried this code:

    f = open("cleanHindi_Translated.txt" , "r")
    array = []
    for line in f:
        array.append(line)

    print(array)

But I am getting an error:

    Traceback (most recent call last):
  File "hindi.py", line 11, in <module>
    for line in f:
  File "C:\Users\Preeti\AppData\Local\Programs\Python\Python37\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 124: character maps to <undefined>
PS C:\Users\Preeti\Downloads\Compressed> python hindi.py
Traceback (most recent call last):
  File "hindi.py", line 11, in <module>
    for line in f:
  File "C:\Users\Preeti\AppData\Local\Programs\Python\Python37\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 124: character maps to <undefined>

I don't understand on what I did wrong here.

  • Haven try something like this before, but I guess should be `.append(line)`. – Wai Kiat Jun 30 '19 at 11:51
  • I tried to include: encoding="utf8" but I am not able to include the read mode - "r" in that case. So I don not think it is a duplicate of that question as the solutions given there have not worked for me. – Praveen Iyer Jun 30 '19 at 11:56
  • Have you tried this https://stackoverflow.com/questions/3277503/how-to-read-a-file-line-by-line-into-a-list like this? – Wai Kiat Jun 30 '19 at 12:00
  • Yes I did try that but ended up getting a similar error. – Praveen Iyer Jun 30 '19 at 12:10
  • Edit question to show your additional attempts. `open` definitely takes an encoding and your error message shows that the encoding is wrong. – Mark Tolonen Jun 30 '19 at 12:14

1 Answers1

1

'lines' is the array (list) you are looking for

import io
with io.open('my_file.txt','r',encoding='utf-8') as f:
   lines = f.readlines()
balderman
  • 22,927
  • 7
  • 34
  • 52
  • I am still getting an error `Traceback (most recent call last): File "hindi.py", line 9, in lines = f.readlines() File "C:\Users\Preeti\AppData\Local\Programs\Python\Python37\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 124: character maps to ` – Praveen Iyer Jun 30 '19 at 12:07
  • @PraveenIyer I have updated the code. – balderman Jun 30 '19 at 12:10
  • This code seems to work but when I tried print(lines) I am getting outputs with question marks in it instead of hindi text. – Praveen Iyer Jun 30 '19 at 12:18
  • @PraveenIyer see https://stackoverflow.com/questions/5203105/printing-a-utf-8-encoded-string – balderman Jun 30 '19 at 12:22
  • Thank you so much this seems to be working. I'll try and put this all together and get the results i wanted. – Praveen Iyer Jun 30 '19 at 12:28
  • @PraveenIyer I am glad I was able to help. Feel free to vote up. – balderman Jun 30 '19 at 12:32
  • Yes I did vote up but since my reputation is less than 15 it is not publicly visible :( – Praveen Iyer Jun 30 '19 at 12:34