-4

So I am trying to scrape a restaurant url on TripAdvisor. The problem is that when I find the link in the HTML for any restaurant it looks like it's encoded. For example on the this restaurant:

https://www.tripadvisor.co.uk/Restaurant_Review-g186338-d13544747-Reviews-Amrutha_Lounge-London_England.html

The element where you can go directly to the website shows the following in the HTML.

data-encoded-url="UEJDX2h0dHA6Ly93d3cuYW1ydXRoYS5jby51ay9fdkoz"

How can I get the actual website?

Daniel Wyatt
  • 960
  • 1
  • 10
  • 29

2 Answers2

3

You can do the following:

import base64
code = "UEJDX2h0dHA6Ly93d3cuYW1ydXRoYS5jby51ay9fdkoz"
decoded = base64.b64decode(code)
print(decoded.decode()) # prints PBC_http://www.amrutha.co.uk/_vJ3

You probably want to get rid of the prefix PBC_ and the suffix _vJ3.

Gilfoyle
  • 3,282
  • 3
  • 47
  • 83
0

Samuel answer is better and it actually is a solution for question, but who knows maybe you can use this on some other case. In this particular case you can also use regular expressions on script tag which hides site link.

import re, requests
from bs4 import BeautifulSoup as bs
url = 'https://www.tripadvisor.co.uk/Restaurant_Review-g186338-d13544747-Reviews-Amrutha_Lounge-London_England.html'

regex = re.compile(r'\"website\":\"http[s]?://www\.[\w]+\.[\w]+[\.]?[\w]+/\"')

response = requests.get(url)
bSoup = bs(response.text, 'html.parser')

soup = bSoup.find_all('script', text=regex)
link = regex.findall(str(soup[0]))
print(link[0][11:-1])

I edit this post and make some explanation. Thank you Samuel for suggestion.

Well, this code will find a website link which is stored in tag using BeautifulSoup and regular expression. bSoup.find_all('script', text=regex) finds two tags. In first one, soup[0], website link is stored. Because there is not just one link, there are few more tripadvisor site links, i used regex as it is shown above to find just one that is needed, link to hotel site. Because regex returns "website":"http://www.amrutha.co.uk", i sliced it with link[0][11:-1] and it returns just http://www.amrutha.co.uk.

  • It may help if you explain a bit what your code does. – Gilfoyle Nov 02 '19 at 20:09
  • It finds a website link which is stored in – Sinner Beekeeping Nov 02 '19 at 22:35
  • Sorry, i made a mistake. First – Sinner Beekeeping Nov 02 '19 at 22:44
  • @You should add this information to your answer and not as a comment. You can edit your answer :) – Gilfoyle Nov 02 '19 at 23:19
  • Ok. Thanx, i will :) – Sinner Beekeeping Nov 03 '19 at 17:14