0

I need to set up headers for urllib.request to get the real page and avoid redirection. If I use just this code:

import urllib.request
urllib.request.urlretrieve("https://open.spotify.com/artist/4npEfmQ6YuiwW1GpUmaq3F", "test.html")

Spotify recognizes that I use an unsupported browser and will redirect me to a different page. I need to get the original HTML and I think the setup headers can help.

ad absurdum
  • 19,498
  • 5
  • 37
  • 60
  • Does this answer your question? [How do I set headers using python's urllib?](https://stackoverflow.com/questions/7933417/how-do-i-set-headers-using-pythons-urllib) – costaparas Dec 21 '20 at 13:44
  • If you're trying to emulate the browser, you may be better off using Selenium instead of urllib https://selenium-python.readthedocs.io/ – GTBebbo Dec 21 '20 at 13:46
  • This is essentially a duplicate of [this](https://stackoverflow.com/questions/7933417/how-do-i-set-headers-using-pythons-urllib) post. Also follows on from your [previous post](https://stackoverflow.com/questions/65392491/python-curl-output-different-from-original-html) about `curl`. – costaparas Dec 21 '20 at 13:47
  • I know the selenium is better fo it but I cant use it because I need this process on the background and also multiplatform. – Tomáš König Dec 21 '20 at 13:56
  • @costaparas Thanks for the advice but I am still not sure which contract should by inside header. – Tomáš König Dec 21 '20 at 13:57
  • Not sure what you mean. Have you tried adding both headers mentioned in the [original post](https://stackoverflow.com/questions/65392491/python-curl-output-different-from-original-html)? Did it not work? – costaparas Dec 21 '20 at 13:59
  • I ran your code (as is) and it works. Are you running it locally? When you open test.htm in notepad does it contain the HTML you expect? – Greg Dec 21 '20 at 14:07
  • I try this: from urllib.request import Request, urlopen req = Request('https://open.spotify.com/artist/711MCceyCBcFnzjGY4Q7Un') req.add_header('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8') req.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.2 Safari/605.1.15') content = urlopen(req).read() print(content) – Tomáš König Dec 21 '20 at 14:08
  • But it still returns the redirected page – Tomáš König Dec 21 '20 at 14:10
  • @Greg Yes is possible to run the code but if you open the URL in the web browser you will get the right page but if you run the code it will store another (some redirected page) – Tomáš König Dec 21 '20 at 14:14

1 Answers1

1

It looks like if a request adds a user-agent, Spotify will preform additional checks. This may be fixed by adding all the headers. Alternatively you can set the user agent to a Brower that Spotify doesn't know about e.g TEST.

I've got the following scrape code to work without any headers, so I wouldn't admit headers unless there was a specific reason. (I've only added it, because of the question in the title).

import requests
from bs4 import BeautifulSoup

urls = [
 'https://open.spotify.com/artist/711MCceyCBcFnzjGY4Q7Un',
 'https://open.spotify.com/artist/4npEfmQ6YuiwW1GpUmaq3F'
]
headers = {
    'user-agent': 'TEST'
}

for url in urls:
    response = requests.get(url, headers=headers)
    html = response.text
    soup = BeautifulSoup(html, 'html.parser')
    #print(soup.prettify())
    print(soup.find('h1').text.strip())

Output:

AC/DC
Ava Max
Greg
  • 4,468
  • 3
  • 16
  • 26