2

I am trying to read a text file that has Instagram public posted images and their meta-data. Each line has one complete post along with all its meta-data. Some part of the image post is written in Arabic. When I am using Python to read the file, but the Arabic text does not show up after printing the line. Arabic text appear as etc. \xd9\x8a\xd8

This is the code snipped I am using to read from the .txt file

 test_file = codecs.open('instagram_info.txt', mode='r', encoding='utf-8')
 print ("reading  images URLs file")
 counter = 0
 for line in test_file:
     print("Line: ", line.encode("utf-8"))
     counter += 1
     print(counter)
     if counter == 50:
     break
test_file.close()

This is a line example from the text file

100158441   25.256887893    51.507485363    Centerpoint 4f09c7a6e4b090ef234993e3               http://scontent.cdninstagram.com/hphotos-xpa1/outbound-distilleryimage9/t0.0-17/OBPTH/9ecde7ecac7811e3b87a12bcaa646ac5_8.jpg sarrah80    25.256887893    51.507485363    2014-03-15 19:37:45 1394912265  16144       ولا راضي يوقف يم الارنوب عشان اصوره dody_nasser said "هههه اكيد خايف الجبان "  nassersahim said "@sarrah80 يبغي يملغ عليكم"  sarrah80 said "@dody_nasser بطل ولدي بس خبرج المود ومايسوي"  sarrah80 said "@nassersahim انت شفت الأرنب شلون يطالعه ذبحني من الضحك "  arwa9009 said "حياتي"  fatimaaljasssim said "حياتتتتتتتنتتي عليهم فديتهم"  6   non_al3yooon,mun.mun_almalki,__manoor__,monaalalii  46

Also, the current code adds "b'" as a prefix for every line being read, Any idea why is this happening ?

Ali Khalil
  • 93
  • 1
  • 2
  • 11

3 Answers3

1
  1. Python 3 naturally supports unicode. You do not need codecs.open. open will work.
  2. .encode is what's causing it to turn into this: \xd9\x8a\xd8 . You can remove that function call. print("Line: ", line)
NightShadeQueen
  • 3,284
  • 3
  • 24
  • 37
  • I tried your advise @NightShadeQueen, but it gave another error see below: return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\ufeff' in position 0: character maps to – Ali Khalil Jul 02 '15 at 00:29
  • Interesting. Are you sure your input is UTF-8 and not UTF-16? See: http://stackoverflow.com/questions/17912307/u-ufeff-in-python-string – NightShadeQueen Jul 02 '15 at 00:33
  • Yes, the text file encoding is UTF-8 @NightShadeQueen, – Ali Khalil Jul 02 '15 at 07:37
0

The problem not with reading the text. The problem is with print(). Your console may not be capable to consume the unicode text. Try to write the result to a file and look inside using a unicode-capable text editor.

Firstly, follow the NightShadeQueen suggestions. Then try to copy the lines to another file to check:

#!python3
with open('instagram_info.txt', mode='r', encoding='utf-8') as fin, \
     open('output.txt', 'w', encoding='utf-8') as fout:
    for n, line in enumerate(fin, 1):
        fout.write(line)
        if n == 50:
            break

Learn the with construct that closes the file object automatically. The enumerate() will count your lines automatically. With this code and with your example stored in instagram_info.txt in UTF-8, you should get the identical output.txt (first 50 lines).

Then try the second example that uses print() in the same case. Notice the end='' in the print -- it suppresses adding the newline automatically as the newline is part of the line.

#!python3
with open('instagram_info.txt', encoding='utf-8') as f:
    for n, line in enumerate(f, 1):
        print(line, end='')
        if n == 50:
            break

If you are using Windows, go to cmd window and try to switch the encoding using

c:\...\>chcp 65001

and run the Python script again. The console still may not be capable to display all characters (the console is rather dumb). It may be easier to display the text in some Python GUI window.

pepr
  • 20,112
  • 15
  • 76
  • 139
  • 1
    don't use `chcp 65001`. To print arbitrary text to Windows console, [use `win-unicode-console` package instead](http://stackoverflow.com/a/30551552/4279) – jfs Jul 02 '15 at 13:24
0

Don't encode the line; print Unicode text directly:

#!/usr/bin/env python3
from itertools import islice

with open('instagram_info.txt', encoding='utf-8-sig') as file:
    print("reading  images URLs file")
    for line in islice(file, 50): # read no more than 50 lines from the file
        print("Line: ", line, end='')
jfs
  • 399,953
  • 195
  • 994
  • 1,670