0

I wrote a small piece of code that searches for a random string of three letters in the App Store:

searchTerm = (''.join(random.choice(string.ascii_lowercase) for i in range(3)))
    urllib.urlretrieve("https://itunes.apple.com/search?country=us&entity=software&limit=5&term=" + str(searchTerm), "download.txt")

Now I wonder how I could do the same, but with Chinese characters, as I'd like to explore the Chinese App Store as well.

I had a look at this question making a list of traditional Chinese characters from a string but it didn't help.

Community
  • 1
  • 1
vandernath
  • 3,665
  • 3
  • 15
  • 24

1 Answers1

0

What is Chinese character is complicated, see What's the complete range for Chinese characters in Unicode?

You could choose Unicode ranges that are suitable for your case:

#!/usr/bin/env python2
import random
import urllib

common, rare = range(0x4e00, 0xa000), range(0x3400, 0x4e00)
chars = map(unichr, rare + common)
random_word = u''.join([random.choice(chars) for _ in range(3)]) # 3 letters
search_term = urllib.quote(random_word.encode('utf-8'), safe='')
path, headers = urllib.urlretrieve("https://itunes.apple.com/search?"
    "country=cn&entity=software&limit=5&term=" + search_term,
    u"download%s.json" % random_word)

Your probably want data = json.load(urllib2.urlopen(url)) instead of saving the response to a file.

Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • Thank you. When I print search_term however I get a weird string of characters like this: %E3%B6%80%E3%B5%8B%E4%9F%BF Why? Problem with encoding? – vandernath Aug 06 '15 at 05:13
  • @vandernath: no, it is expected for an urlencoded value. You want to print `random_word` instead (Unicode string). – jfs Aug 06 '15 at 05:15
  • Got it. Then it's the line search_term that I don't understand, why is it necessary to encode random_word before searching? I'd just like to understand your code as I'm a beginner. :-) – vandernath Aug 06 '15 at 05:17
  • @vandernath: url can't contain non-ASCII characters. They have to be percent-encoded. Some browsers decode the url in the address bar automatically (to display it to a human) but they always send [percent-encoded urls](https://tools.ietf.org/html/rfc3986#section-2.1) to an http server. Run a network sniffer such as `tcpdump`, `wireshark` to see raw urls (or just copy an url with Unicode and paste it into a text editor. – jfs Aug 06 '15 at 05:30