1

I wrote a program to scraping the web to get a json subtitle. that json is in Persian language. I used decode("utf-8") but my character is code. what should I do?

My python is 3.4 and my OS is windows8, this is my code:

>>> import urllib.request as urllib2
>>> print(urllib2.urlopen('http://www.ted.com/talks/subtitles/id/667/lang/fa').read().decode("utf-8"))

{"captions":[{"duration":4000,"content":"\u0627\u0645\u0631\u0648\u0632\u0647 \u062a\u0645\u0627\u0645 \u0628\u0646\u0627\u0647\u0627 \u06cc\u06a9 \u0686\u06cc\u0632 \u0645\u0634\u062a\u0631\u06a9 \u062f\u0627\u0631\u0646\u062f.","startOfParagraph"...

The first row is this: enter image description here

I use this way to write my string to a file but problem is exists yet:

with open('D:\\result.json', 'w') as fid:
    fid.write(urllib2.urlopen('http://www.ted.com/talks/subtitles/id/667/lang/fa').read().decode("utf-8"))
parvij
  • 1,381
  • 3
  • 15
  • 31

1 Answers1

0

You have JSON there, with the Arabic characters escaped as permitted by RFC 7159. You need to parse it with json in order to undo the escaping. Once you've done that, you should be able to extract the "contents" value and print that (to a file, since Windows can't always display Unicode properly at the console). Something like this:

>>> import urllib.request as urllib2
>>> result = json.loads(urllib2.urlopen('...').read().decode('utf8'))
>>> with open('example.txt', 'w', encoding='utf8') as f:
...     print(result['captions'][0]['content'], file=f)

You should then be able to open example.txt with your editor of choice. If it displays incorrectly, be sure to set the encoding to UTF-8.

Community
  • 1
  • 1
Kevin
  • 28,963
  • 9
  • 62
  • 81