1

For a concrete example of the problem, when I go to the following address in a regular browser:

http://www.aaai.org/ocs/index.php/SOCS/SOCS16/paper/viewFile/13951/13240

I get re-directed to https:

https://www.aaai.org/ocs/index.php/SOCS/SOCS16/paper/viewFile/13951/13240

I tried the following in the Python interactive shell:

>>> from selenium import webdriver
>>> driver = webdriver.PhantomJS()
>>> driver.get("http://www.aaai.org/ocs/index.php/SOCS/SOCS16/paper/viewFile/13951/13240")
>>> driver.current_url
u'http://www.aaai.org/ocs/index.php/SOCS/SOCS16/paper/viewFile/13951/13240'

As is seen from the output, the re-direction did not happen. I waited a bit and issued driver.current_url once again, but the same output came out. How do I cause Selenium to get re-directed the way it happens in a regular browser?

EDIT: I tried to send Selenium directly to the https address and it would not go! Could it be because that url is a file? If this is a normal behavior, then how can I find out the file's url when I only have the http link?

AlwaysLearning
  • 7,257
  • 4
  • 33
  • 68

1 Answers1

1

The issue is that your page doesn't use a 30X. Instead it use a different approach of using a Refresh header. Refresh header is in the form of

Refresh: 5; url=http://www.example.org/fresh-as-a-summer-breeze

Where 5 means that load url after 5 seconds. You can see how I extract the url which it redirects to using IPython + Requests

In [1]: import requests

In [2]: res = requests.get("http://www.aaai.org/ocs/index.php/SOCS/SOCS16/paper/viewFile/13951/13240")

In [3]: res
Out[3]: <Response [200]>

In [4]: res.text
Out[4]: ''

In [5]: res.headers
Out[5]: {'Date': 'Fri, 29 Sep 2017 10:52:14 GMT', 'Server': 'Apache', 'Refresh': '0; url=https://www.aaai.org/ocs/index.php/SOCS/SOCS16/paper/viewFile/13951/13240', 'Set-Cookie': 'OCSSID=c5eifnobt0942860sraccb2cs0; path=/ocs/', 'Content-Length': '0', 'Keep-Alive': 'timeout=5, max=100', 'Connection': 'Keep-Alive', 'Content-Type': 'text/html; charset=UTF-8'}

In [6]: res.headers['Refresh']
Out[6]: '0; url=https://www.aaai.org/ocs/index.php/SOCS/SOCS16/paper/viewFile/13951/13240'

In [7]: res.headers['Refresh'].split("url=")[-1]
Out[7]: 'https://www.aaai.org/ocs/index.php/SOCS/SOCS16/paper/viewFile/13951/13240'
Tarun Lalwani
  • 142,312
  • 9
  • 204
  • 265
  • This is even better than selenium, since I do not have to wait for the re-direction to actually happen! However, I would like to make my solution general (i.e. for both *30X* and for *refresh header*). Is the following for 30X: https://stackoverflow.com/a/32528675/2725810 ? Also, are there other possibilities besides *30X* and *refresh header* to consider? Lastly, is my understanding correct that the mechanism used by the page is encoded in the HTTP response code (so that 200 stands for *refresh header* and 302 stands for *30X*)? – AlwaysLearning Sep 29 '17 at 12:55