2

I want to scrape this website https://lens.zhihu.com/api/v4/videos/1123764263738900480 to get the play_url using Python.

This website has a very quick redirect and the url is unchanged. The play_url in the original page is invalid, if you want to visit it, you will see "You do not have permission...". So I use time.sleep(10) in the program to handle the redirect (This seems not to work with Requests).

(Sorry, I have made a mistake. The process of redirecting I see may just caused by my Firefox browser. But the method I mentioned really can handle redirect.)

But as I see in 1.txt in the program, the scraped content doesn't have the play_url I want and the url in it is still invalid.

Here is the play_url I want, can be seen in browser Inspector in the tag a and the value of its class is url: image

Here is the code I use:

import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options

service = Service(executable_path='C:\\Users\\X\\chromedriver_win32\\chromedriver.exe')
options = Options()
options.add_experimental_option('excludeSwitches', ['enable-logging'])
options.add_argument("--headless")
options.add_argument('user-agent="Mozilla/5.0"')
driver = webdriver.Chrome(service=service, options=options)

driver.get('https://lens.zhihu.com/api/v4/videos/1123764263738900480')
time.sleep(10)
pageSource = driver.page_source
driver.quit()
bs = BeautifulSoup(pageSource, 'html.parser')

with open('C:\\Users\\X\\Desktop\\1.txt', 'a', encoding='utf-8') as file:
    file.write(f"{bs}")
play_url = bs.find('a', {'class': 'url'}).get("title")
print(play_url)

and it returns:

Traceback (most recent call last):
  File "c:\Users\X\Desktop\Handle redirect\stackoverflow.py", line 22, in <module>      
    play_url = bs.find('a', {'class': 'url'}).get("title")
AttributeError: 'NoneType' object has no attribute 'get'

So in the scraped content, there is no ('a', {'class': 'url'}) which I see in browser Inspector.

Why the scraped content is different from what I see in browser Inspector and how to handle it?

Edit: Thanks to the comment by Martin Evans now I know that browser handles Javascript from the source code so it looks differently from the source code. But in my case, I don't see any js links in the Network of Developer Tools. Actually, there are only two links: image. So I still have no idea about the question above.

Update: Thanks to the comment by @Sarhan I solve the problem. I use Firefox before and the browser renders the source code automatically even though the tag is not existed. I try the url in Edge and there is no ('a', {'class': 'url'}) at all. Besides, thanks a lot to @Dimitar so I can get the play_url.

Harris
  • 31
  • 7
  • 1
    The HTML is often rewritten in the browser using Javascript. `requests` gives the raw HTML with no Javascript processing – Martin Evans Apr 29 '22 at 12:00
  • But I'm using `Selenium` in my case. It seems to handle the Javascript in html. – Harris Apr 29 '22 at 12:16
  • Selenium is a remote control for a browser backend – Martin Evans Apr 29 '22 at 12:18
  • Sorry, I don't understand what you said. But I scraped another website using Selenium, it renders the html so I can use `find` to locate elements which I see in browser Inspector. – Harris Apr 29 '22 at 12:25
  • Selenium is loading a web browser for you, you choose which. e.g. Chrome, Firefox or other. The browser is doing its normal Javascript processing. This is also why it is much slower than using requests. In most cases it is possible to extract the same data using just requests but it is more complicated as additional calls are needed to extract the data from various API calls that the browser makes. – Martin Evans Apr 29 '22 at 12:30
  • I see. Selenium loads a browser, the browser handles the Javascript. But in this case it doesn't work. I don't see any js file in Network of Inspector. – Harris Apr 29 '22 at 12:38
  • "So in the scraped content, there is no ('a', {'class': 'url'}) which I see in browser Inspector." Strange. When I try this, I **don't see a web page at all**; I see a JSON document. Any HTML that your browser's Inspector view shows you, is the HTML of a fake page used to render and present the JSON. Before looking at the inspector, look at the page source, and at the page. – Karl Knechtel Mar 28 '23 at 11:04
  • As an aside, any url that has something like `api/v4/` in it (or `v1`, or any other number) is almost certainly intended to return a JSON response. – Karl Knechtel Mar 28 '23 at 11:06

1 Answers1

4

You can use Requests to get the response from that url and extract play_url.

import requests

url = "https://lens.zhihu.com/api/v4/videos/1123764263738900480"

headers = {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.82 Safari/537.36"
}

response = requests.get(url, headers=headers)

data = response.json()
play_url = data['playlist']['LD']['play_url']

print(play_url)
Dimitar
  • 659
  • 7
  • 6
  • 1
    @ndjcf the returned data is JSON. I think your browser inspector was showing you the source code of JSON beautified by your browser(colorized and links replaced with anchor) . Next time use view source instead of inspector when you are not sure if the output is correct or log/output pageSource before parsing and searching inside it with your parser. – Sarhan Apr 29 '22 at 13:44