3

I have a url like

href="../job/jobarea.asp?C_jobtype=經營管理主管&peoplenumber=151",

this is shown in inspect element. But when opened in new tab it is showing as

../job/jobarea.asp?C_jobtype=%B8g%C0%E7%BA%DE%B2z%A5D%BA%DE&peoplenumber=151

How do I know which type of encoding is used by the browser to convert it. When I try to do scrapy it is showing some other format and it is stopping as 500 internal server error. Could you please explain me??

Dev Pandu
  • 121
  • 2
  • 12
  • 1
    Does the HTML page have any `` headers that set the page codec? There could also be a content type set in the HTTP headers (`Content-Type: text/html; charset=....`). – Martijn Pieters Apr 07 '15 at 08:45
  • @MartijnPieters The page has only headers like:`Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8 Accept-Encoding:gzip, deflate, sdch Accept-Language:en-US,en;q=0.8 Cache-Control:no-cache Connection:keep-alive Cookie:case_noteice=mycase; myjobcrm=crmid=myjob; connother%5Fdb=DB1; connjob%5Fdb=DB2; ASPSESSIONIDASARCSTS=MJGFLIOCJADBKKMKMFDEIPNA Host:www.myjob.com.tw Pragma:no-cache User-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36` – Dev Pandu Apr 07 '15 at 10:50
  • 1
    It'll have a `Content-Type` header too. The browser uses a `charset` parameter in that header if no characterset has been defined in the page itself. – Martijn Pieters Apr 07 '15 at 10:52
  • @MartijnPieters Response Headers contain `Content-Type:text/html` , In response page it contains as `charset=big5` – Dev Pandu Apr 08 '15 at 03:40
  • @MartijnPieters Got the solution.. In response page it contains `charset=big5` so used @Aaron solution and got the url as it is. Thank you so much guys – Dev Pandu Apr 08 '15 at 04:08

1 Answers1

3

It's Tradtional Chinese, so try cp950

#-*-coding:utf8 -*-

import urllib
s = '經營管理主管'.decode('utf-8').encode('cp950')
print urllib.quote(s)

q ='%B8g%C0%E7%BA%DE%B2z%A5D%BA%DE'
print urllib.unquote(q).decode('cp950').encode('utf-8')

Result

%B8g%C0%E7%BA%DE%B2z%A5D%BA%DE
經營管理主管
Aaron
  • 2,383
  • 3
  • 22
  • 53
  • @philshem, i need it in python 2.7 – Dev Pandu Apr 07 '15 at 10:52
  • @Aaron, Super thing. Loved it. But the point is when I do it this way, I am getting 404 page unresponsive... This is the url I am able to print to console... **u'../job/jobarea.asp?C_jobtype=\u7d93\u71df\u7ba1\u7406\u4e3b\u7ba1&peoplenumber =151'** – Dev Pandu Apr 07 '15 at 10:54
  • What is the full url you want to get? – Aaron Apr 07 '15 at 10:59
  • @Aaron, I am getting a url like this when I write xpaths.....**u'../job/jobarea.asp?C_jobtype=\u7d93\u71df\u7ba1\u7406\u4e3b\u7ba1&peoplenumbe‌​r =151'** .... This is the original url I should be getting on redirecting.... **'%B8g%C0%E7%BA%DE%B2z%A5D%BA%DE'**, this is the original one shown in website. [website_link](http://www.myjob.com.tw/job/jobzone.asp) – Dev Pandu Apr 08 '15 at 03:34
  • @Aaron , Thank you so much. Got the solution. In response page `charset=big5` is present. So in place of `.encode('cp950')` I have given `big5` and got the correct output now. Thank you so much – Dev Pandu Apr 08 '15 at 04:09