0

I'm using

requests.get('https://www.pastemagazine.com/search?t=tweets+of+the+week&m=Lists')

like so:

import requests
from bs4 import BeautifulSoup
url = 'https://www.pastemagazine.com/search?t=tweets+of+the+week&m=Lists'
thepage = requests.get(url)
urlsoup = BeautifulSoup(thepage.text, "html.parser")
print(urlsoup.find_all("a", attrs={"class": "large-3 medium-3 cell image"})[0])

But it keeps scraping not from the full url, but just from the homepage ('https://www.pastemagazine.com'). I can tell because I expect the print statement to print:

<a class="large-3 medium-3 cell image" href="/articles/2018/12/the-funniest-tweets-of-the-week-109.html" aria-label="">
    <picture data-sizes="[&quot;(min-width: 40em)&quot;,&quot;(min-width: 64em)&quot;]" class="lazyload" data-sources="[&quot;https://cdn.pastemagazine.com/www/opt/120/dogcrp-72x72.jpg&quot;,&quot;https://cdn.pastemagazine.com/www/opt/120/dogcrp-151x151.jpg&quot;,&quot;https://cdn.pastemagazine.com/www/opt/120/dogcrp-151x151.jpg&quot;]">
      <img alt="" />
    </picture>
  </a>

But instead it prints:

<a aria-label='Daily Dose: Michael Chapman feat. Bridget St. John, "After All This Time"' class="large-3 medium-3 cell image" href="/articles/2019/01/daily-dose-michael-chapman-feat-bridget-st-john-af.html"> 
    <picture class="lazyload" data-sizes='["(min-width: 40em)","(min-width: 64em)"]' data-sources='["https://cdn.pastemagazine.com/www/opt/300/MichaelChapman2019_ConstanceMensh_Square-72x72.jpg","https://cdn.pastemagazine.com/www/opt/300/MichaelChapman2019_ConstanceMensh_Square-151x151.jpg","https://cdn.pastemagazine.com/www/opt/300/MichaelChapman2019_ConstanceMensh_Square-151x151.jpg"]'>
      <img alt='Daily Dose: Michael Chapman feat. Bridget St. John, "After All This Time"'/>
    </picture>
  </a>

Which corresponds to an element on the homepage, rather than the specific url I want to scrape from with the search terms. Why does it redirect to the homepage? How can I stop it from doing so?

0xInfection
  • 2,676
  • 1
  • 19
  • 34
R. Ni
  • 23
  • 1
  • 3
  • Try [looking at redirection history](http://docs.python-requests.org/en/master/user/quickstart/#redirection-and-history) of the request and see how it gets redirected. Open the dev tools in your browser and watch how your URL opens there, and compare the headers passed by the browser and by your script. – 9000 Jan 04 '19 at 03:41
  • Possible duplicate of [Python Requests library redirect new url](https://stackoverflow.com/questions/20475552/python-requests-library-redirect-new-url) – 0xInfection Jan 04 '19 at 04:14

2 Answers2

2

If you're sure about the redirection part, you can set the allow_redirects to False to prevent redirection.

r = requests.get(url, allow_redirects=False)
jayp
  • 192
  • 2
  • 13
0xInfection
  • 2,676
  • 1
  • 19
  • 34
0

To get the required urls connected to tweets, you can try the following script. Turn out that using headers along with cookies solves the redirection issues.

import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

url = "https://www.pastemagazine.com/search?t=tweets+of+the+week&m=Lists"

with requests.Session() as s:
    res = s.get(url,headers={"User-Agent":"Mozilla/5.0"})
    soup = BeautifulSoup(res.text,'lxml')
    for item in set([urljoin(url,item.get("href")) for item in soup.select("ul.articles a[href*='tweets-of-the-week']")]):
        print(item)

Or to make it even easier, upgrade the following libraries:

pip3 install lxml --upgrade
pip3 install beautifulsoup4 --upgrade

And then try:

with requests.Session() as s:
    res = s.get(url,headers={"User-Agent":"Mozilla/5.0"})
    soup = BeautifulSoup(res.text,'lxml')
    for item in soup.select("a.noimage[href*='tweets-of-the-week']"):
        print(urljoin(url,item.get("href")))
SIM
  • 21,997
  • 5
  • 37
  • 109
  • This worked! Using headers does indeed solve redirection issues. I did not try upgrading libraries to attempt the second code snippet. Thanks so much! – R. Ni Jan 05 '19 at 02:58