1

I have writen this function to scrape top 10 results from google search:

def google_search(self,query):
    """
        This function returns the urls of top 10  of google search result for a keyword
    """
    params = {'q':query}
    url = 'https://www.google.com/search?'+urllib.urlencode(params)
    result = urlfetch.fetch(url=url)
    content = result.content
    soup = BeautifulSoup(content)
    list = soup.findAll("li", {'class':'g'})
    urls = []
    for item in list:
        link = item.findAll('a')[0]
        url = 'https://www.google.com'+link['href']
        urls.append(url.encode('utf-8'))
    return urls

Then I wrote this other function that find related wikepedia articles based on google search

def wikipedia_search(self,query,language='en'):
    """
        This function returns a list of urls and title of top wikepedia search result for a keyword
    """
    q = query+u' site:%s.wikipedia.org' %language
    urls = self.google_search(q.encode('utf-8'))
    list =[]
    for url in urls:
        title = re.findall(r'/wiki/(.*)&s',url.encode('utf-8'))[0].replace("_"," ")
        link = re.findall(r'q=(.*)&s',url)[0]
        url_tag = {'url':link ,'title' :title}
        list.append(url_tag)
    return list

But when i try some search in arabic language I get result like this : {'title': '%25D8%25AD%25D9%2583%25D9%2588%25D9%2585%25D8%25A9', 'url': 'https://ar.wikipedia.org/wiki/%25D8%25AD%25D9%2583%25D9%2588%25D9%2585%25D8%25A9'}, {'title': '%25D8%25A8%25D9%258A%25D8%25AA %25D9%2588%25D9%258A%25D9%2586%25D8%25AF%25D8%25B3%25D9%2588%25D8%25B1', 'url': 'https://ar.wikipedia.org/wiki/%25D8%25A8%25D9%258A%25D8%25AA_%25D9%2588%25D9%258A%25D9%2586%25D8%25AF%25D8%25B3%25D9%2588%25D8%25B1'} that basically I can not explore.

Nazih AIT BENSAID
  • 133
  • 1
  • 2
  • 13

1 Answers1

0

The data is UTF-8 encoded bytes escaped with URL quoting, so you want to decode:

url=urllib.unquote(url).decode('utf8')

Demo:

>>> import urllib 
>>> url='example.com?title=%D0%BF%D1%80%D0%B0%D0%B2%D0%BE%D0%B2%D0%B0%D1%8F+%D0%B7%D0%B0%D1%89%D0%B8%D1%82%D0%B0'
>>> urllib.unquote(url).decode('utf8') 
u'example.com?title=\u043f\u0440\u0430\u0432\u043e\u0432\u0430\u044f+\u0437\u0430\u0449\u0438\u0442\u0430'
>>> print urllib.unquote(url).decode('utf8')
example.com?title=правовая+защита

(Post directly quoted from Url decode UTF-8 in Python since I can't comment yet)

Community
  • 1
  • 1
pie3636
  • 795
  • 17
  • 31