Trouble scraping a website link using requests

Question

I'm trying to fetch a website link connected to this L'atelier de willy restaurant from a webpage but I can't make it.

Website address

This is how it is visible in that page (within the same block where the name of the restaurant is visible as very bold letters):

I've tried with:

import requests
from bs4 import BeautifulSoup

link = "https://www.tripadvisor.fr/Restaurant_Review-g188644-d14788983-Reviews-Mozart_More_Than_Just_Ribs-Brussels.html"

res = requests.get(link,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text,"lxml")
website = soup.select_one("[class*='website']").get("data-ahref")
print(website)

Output I'm getting:

q5aizCJEIWEVtIiYHVLaizCJEIWHEpttVcL4pIaQtipEnV1zS0pIaQaVMSpa1EVTVEEJc

What I wish to get:

https://mozart-resto.be/

How can I pasre that website link using requests?

Andrej Kesely · Accepted Answer · 2019-07-31T17:12:15.250

The site is using "asdf"-encoder (I'm not sure if that's official name). But using the reply from Converting JavaScript code to Python, you should be able to decode this string:

d = "q5aizCJEIWEVtIiYHVLaizCJEIWHEpttVcL4pIaQtipEnV1zS0pIaQaVMSpa1EVTVEEJc"

def asdf(d):

  b = ""
  h = {
    "": ["&", "=", "p", "6", "?", "H", "%", "B", ".com", "k", "9", ".html", "n", "M", "r", "www.", "h", "b", "t", "a", "0", "/", "d", "O", "j", "http://", "_", "L", "i", "f", "1", "e", "-", "2", ".", "N", "m", "A", "l", "4", "R", "C", "y", "S", "o", "+", "7", "I", "3", "c", "5", "u", 0, "T", "v", "s", "w", "8", "P", 0, "g", 0],
    "q": [0, "__3F__", 0, "Photos", 0, "https://", ".edu", "*", "Y", ">", 0, 0, 0, 0, 0, 0, "`", "__2D__", "X", "<", "slot", 0, "ShowUrl", "Owners", 0, "[", "q", 0, "MemberProfile", 0, "ShowUserReviews", '"', "Hotel", 0, 0, "Expedia", "Vacation", "Discount", 0, "UserReview", "Thumbnail", 0, "__2F__", "Inspiration", "V", "Map", ":", "@", 0, "F", "help", 0, 0, "Rental", 0, "Picture", 0, 0, 0, "hotels", 0, "ftp://"],
    "x": [0, 0, "J", 0, 0, "Z", 0, 0, 0, ";", 0, "Text", 0, "(", "x", "GenericAds", "U", 0, "careers", 0, 0, 0, "D", 0, "members", "Search", 0, 0, 0, "Post", 0, 0, 0, "Q", 0, "$", 0, "K", 0, "W", 0, "Reviews", 0, ",", "__2E__", 0, 0, 0, 0, 0, 0, 0, "{", "}", 0, "Cheap", ")", 0, 0, 0, "#", ".org"],
    "z": [0, "Hotels", 0, 0, "Icon", 0, 0, 0, 0, ".net", 0, 0, "z", 0, 0, "pages", 0, "geo", 0, 0, 0, "cnt", "~", 0, 0, "]", "|", 0, "tripadvisor", "Images", "BookingBuddy", 0, "Commerce", 0, 0, "partnerKey", 0, "area", 0, "Deals", "from", "//", 0, "urlKey", 0, "'", 0, "WeatherUnderground", 0, "MemberSign", "Maps", 0, "matchID", "Packages", "E", "Amenities", "Travel", ".htm", 0, "!", "^", "G"]
  }

  #for a in range(len(d)):     ## REMOVE and Change this to a WHILE loop (see below)
  a = 0                        #  Manually initialize your loop
  while a < len(d):
    j = d[a]
    f = j
    list = []

    for key in h:
      list.append(key)

    if (j in list) and (a < len(d)):
       a = a + 1                  ## CANNOT DO THIS, if you a use "for a in.." loop.  So, we'll use a WHILE loop instead
       f = f + d[a]
    else:
       j = ""

    g = getOffset(ord(d[a]))
    if  g < 0 :                   ## REMEMBER TO UPDATE THIS LINE to REMOVE # or type(h[j][g]) is str:
        b = b + f
    else:
        b = b + str(h[j][g])
        #print b                  # REMOVE this line

    a = a + 1                     # Manually increment your WHILE loop
  return b

def getOffset(a):
    if(a >= 97 and a <= 122):
        return(a-61)
    if(a >= 65 and a <= 90):
        return(a-55)
    if(a >= 48 and a <=71):
        return(a-48)
    print ("\n\nERROR\n\n")
    return(-1)

print(asdf(d))

Prints:

https://mozart-resto.be/mozart-brussel/?utm_source=tripadvisor&utm_medium=referral

EDIT (For selecting the link):

from bs4 import BeautifulSoup
import requests

url = 'https://www.tripadvisor.fr/Restaurant_Review-g188644-d14788983-Reviews-Mozart_More_Than_Just_Ribs-Brussels.html'

soup = BeautifulSoup(requests.get(url).text, 'lxml')

s = soup.select_one('div[data-ahref]:has(span:contains("Site Web"))')
if s:
    site_web = asdf(s['data-ahref']) # this is the decoder function
    print(site_web)

Prints:

https://mozart-resto.be/mozart-brussel/?utm_source=tripadvisor&utm_medium=referral

Awesome @Andrej!! Thanks for your time. – MITHU Jul 31 '19 at 18:08 — MITHU, Jul 31 '19 at 18:08

12944qwerty · Answer 2 · 2019-07-31T17:10:10.707

~~Unfortunately, I don't have commenting yet, and this is supposed to be a comment :(~~

So, your code is looking for data_ahref which is supposed to get what your output shows. The source of the website shows

<div class="is-hidden-mobile blEntry website  ui_link" data-ahref="q5aizCJEIWEVtIiYHVLaizCJEIWHEpttVcL4pIaQtipEnV1zS0pIaQaVMSpa1EVTVEEJc" data-column="2" data-trackingkey="URL_EATERY" data-eventname="bl_contact_website" data-blcontact="URL_HOTEL" onclick="widgetEvCall('handlers.onWebLinkClicked', event, this)"><span class="primary_icon ui_icon laptop"></span><span class="detail ">Site Web</span></div>

and it says that data-ahref="q5aizCJEIWEVtIiYHVLaizCJEIWHEpttVcL4pIaQtipEnV1zS0pIaQaVMSpa1EVTVEEJc". This means that your code is working properly (and bs4).

Another thing is that when I click on the Site Web link, I am brought to this link and not the link you wish for. And the link you want is not found anywhere in the source code.

So, are you sure you are looking for the right things?

EDIT: Looking at Andrej Kesely's answer, I realize that the first part of my answer doesn't count. I didn't realize that data-ahref is actually a encoded string that is the url.

score 0 · Answer 3 · answered Jul 31 '19 at 17:13

0

Why don't you checkout https://html.python-requests.org/ this was written with the intention of using it for parsing web pages.

answered Jul 31 '19 at 17:13

csurfer

296
1
7

Trouble scraping a website link using requests

3 Answers3