1

I'm trying to scrape a website and I'm using Selenium to help me do it, but I'm having problem. There are 150 pages I need to check and they're of the form "base_url&page=X". But when I call driver.get("base_url&page=x") it strips off the &page=x for some reason.

When I print the link, it shows up correctly as "base_url&page=X", but it opens base_url when I click on it, but if I copy and paste the link then it brings me to the correct page -- "base_url&page=X".

Any idea what the problem is or how to go about fixing it?

for i in range(1, 5):
    page_url = BASE_URL + "&page=" + str(i)
    parsed_site = get_page(page_url)

def get_page(url):
    DRIVER = webdriver.Chrome(chrome_options=chrome_options)
    DRIVER.get(url)
    time.sleep(2)
    data = DRIVER.page_source
    DRIVER.close()
    return BeautifulSoup(data, "html.parser")

Stack Timeout in regards to followup answer:

 Traceback (most recent call last):
    File "/Users/x/PycharmProjects/proj/src/scraper3.py", line 335, in <module>
       sys.exit(main())
    File "/Users/x/PycharmProjects/proj/src/scraper3.py", line 309, in main
       parsed_site = get_next_page(DRIVER, page_url) 
    File "/Users/x/PycharmProjects/proj/src/scraper3.py", line 267, in get_next_page
       DRIVER.get(url)
    File "/Users/x/PycharmProjects/proj/venv/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 324, in get
       self.execute(Command.GET, {'url': url})
    File "/Users/x/PycharmProjects/proj/venv/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 312, in execute
       self.error_handler.check_response(response)
    File "/Users/x/PycharmProjects/proj/venv/lib/python2.7/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
       raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message: timeout
   (Session info: chrome=64.0.3282.167)
   (Driver info: chromedriver=2.35.528157 (4429ca2590d6988c0745c24c8858745aaaec01ef),platform=Mac OS X 10.13.3 x86_64)
st2 tas
  • 91
  • 1
  • 12

1 Answers1

0

I believe the problem you are having is that the Domain requires you to travel to it and store some cache data before you start traversing down pages, but you are opening a new driver every time you are going down a page. Try this:

def get_page(DRIVER, url):
    DRIVER.get(url)
    time.sleep(2)
    data = DRIVER.page_source
    return BeautifulSoup(data, "html.parser")

DRIVER = webdriver.Chrome(chrome_options=chrome_options)
DRIVER.get(BASE_URL)
parsedList = []
for i in range(1, 5):
    page_url = BASE_URL + "&page=" + str(i)
    parsed_site = get_page(DRIVER, page_url)
    parsedList.append(parsed_site)
for source in parsedList: print(source)
DRIVER.quit()

EDIT:

After the initial issue, you started to experience an issue with the current chromedriver 2.35 and Chrome Build 64.. The answer to that error is here, glad I could help.

PixelEinstein
  • 1,713
  • 1
  • 8
  • 17
  • So I originally tried using a global for the webdriver (hence why it's capital), but on the second DRIVER.get(url) call it always times out. As in, the page opens, but it never reaches the print statement on the line immediately following the second .get(url). But I tried your snippet, and it's giving me the same problem -- timing out on the second get request. – st2 tas Mar 02 '18 at 17:19
  • Can you post the **stack** you are receiving after the `timeout`? – PixelEinstein Mar 02 '18 at 17:23
  • Okay, I've been having a rough time with .get() in chrome version `64.`, can you read the answer to [this](https://stackoverflow.com/questions/48666620/python-selenium-webdriver-stuck-at-get-in-a-loop/48692986#48692986), and let me know if it helps you. – PixelEinstein Mar 02 '18 at 17:41
  • chrome option '--disable-browser-side-navigation' solved it and now it works perfectly. Thank you so much. – st2 tas Mar 02 '18 at 18:10
  • Great! I will edit my answer to include the thread that solved it, if you could please accept it as the answer. – PixelEinstein Mar 02 '18 at 18:12
  • @st2tas, do you still need help with this? – PixelEinstein Mar 07 '18 at 16:46
  • I actually do have another question now related to this. I'm running myprogram1.py, which calls webscrape.py, which creates a global web driver as described. I'd like to also run myprogram2.py, which also calls webscrape.py, but it just opens a chrome window but doesn't load anything (and eventually throws the error: selenium.common.exceptions.WebDriverException: Message: unknown error: Chrome failed to start: exited abnormally). Is there something I need to change to make it work? – st2 tas Mar 08 '18 at 14:05
  • I just used the Firefox driver instead for the second instance. So all is good again. – st2 tas Mar 08 '18 at 14:47