2

I am streaming arabic tweets that are stored as a ".jsonl" file. When opening the file in Xcode, Brackets or textEdit the arabic characters are shown as "\u0645\u0635\u0631: \u0625\u0646\u0647\u0627\u0621 \u0628\u0639\u0636 \u0627\u0644". But in order to analyse the content, I would need the files to be read correctly displaying the actual arabic text. I've managed to print them in the Python(3) console, but I still need them in a separate file. I feel like this is a pretty simple thing to do, but when using io.open etc I always run into problems. Appreciative for any ideas!

This is the code that worked for me to print them in the python console:

import json
outFile = open('user_timeline_almanarnews.jsonl', 'r').read()
splitFile = outFile.split('\n')

for eachLine in splitFile:
    x = eachLine.encode('utf-8')
    print(x.decode('unicode-escape'))
STF
  • 1,485
  • 3
  • 19
  • 36
Josephina K.
  • 87
  • 2
  • 8
  • The way you see your strings is an abstract representation of the actual bytes and it's handled by your IDE our any other application that is responsible for that. i.e it's nothing that python can do for you. – Mazdak Oct 28 '17 at 11:04
  • Check [this](https://stackoverflow.com/questions/14980421/arabic-characters-in-json-decoding) out – HISI Oct 28 '17 at 11:07
  • @YassineSihi, thanks! I did see these posts. My question is more directed at how to create a new jsonl file that displays the characters correctly... – Josephina K. Oct 28 '17 at 11:12
  • @Kasramvd so there is no way I can save a .jsonl file that displays the arabic characters correctly? Cause I can convert the "unreadable" jsonl into a csv which is then subsequently readable in TextEdit or Excel but I would prefer to continue my analysis with the complete .jsonl file... – Josephina K. Oct 29 '17 at 15:06
  • @JosephinaK. I don't know if you can do that or not. As I said it all depends on your IDE, maybe you need to do these adjustments in your config. – Mazdak Oct 30 '17 at 06:44

0 Answers0