0

I am applying a function that scrapes a url using selenium to a pandas dataframe. I am scraping many websites (on the order of 104). After a 50 or so websites are scraped successfully, I get an InvalidSessionId error. I only close the driver explicitly at the end of my computation, so I am confused why I am getting this error.

Here is the code sample for reference:

This is the code that scrapes each individual website

def scrape_all_text(url, keyword, wd):
  try:
    print(str(url))
    if (str(url).startswith("http://") or str(url).startswith("https://")):
      wd.get(str(url))
    else:
      wd.get("http://" + str(url))
    text = wd.find_element_by_tag_name("body").text.replace('\n', ' ')
    print(f"KEYWORD: {keyword}, TEXT: {text}")
    return text
  except WebDriverException as e:
     print(f"KEYWORD: {keyword}, TEXT: {None}, EXCEPTION: {e}")
     return None

This is the generator that is supposed to scrape a subset of my websites and yield them to me.

def split_and_scrape(split_percent, df, col_to_add, scrape_func):
  num_splits = math.ceil(np.reciprocal(split_percent))
  entries_per_split = int(len(df.index) * split_percent)
  split_df_list = np.array_split(df, num_splits)
  for i, split in enumerate(split_df_list):
    wd = webdriver.Chrome('chromedriver',chrome_options=chrome_options)
    wd.set_page_load_timeout(20)
    print(f"Running on {entries_per_split*i}th - {entries_per_split*(i+1)}th entries")
    split[col_to_add] = split.apply(lambda x: scrape_func(x['guess_site_url'], x['keyword'], wd), axis=1)
    wd.close()
    yield split

This is the error I run into after a while:

InvalidSessionIdException                 Traceback (most recent call last)
<ipython-input-90-55b98c96b157> in <module>()
      2 # wd.set_page_load_timeout(20)
      3 # merged['page_contents'] = merged.apply(lambda x: scrape_all_text(x['guess_site_url'], x['keyword'], wd), axis=1) #next put in function where merged saves every few entries
----> 4 for i, split in enumerate(split_and_scrape(0.001, merged, 'page_contents', scrape_all_text)):
      5   split.to_csv(f"page_contents_{i}.csv")

3 frames
/usr/local/lib/python3.7/dist-packages/selenium/webdriver/remote/errorhandler.py in check_response(self, response)
    245                 alert_text = value['alert'].get('text')
    246             raise exception_class(message, screen, stacktrace, alert_text)  # type: ignore[call-arg]  # mypy is not smart enough here
--> 247         raise exception_class(message, screen, stacktrace)
    248 
    249     def _value_or_default(self, obj: Mapping[_KT, _VT], key: _KT, default: _VT) -> _VT:

InvalidSessionIdException: Message: invalid session id

For context, I am running this code in Google Colab, although I am not sure why this would be relevant to the error I'm getting

  • does this helps you? https://stackoverflow.com/q/56483403/14541164 – Utpal Kumar Jun 20 '22 at 14:11
  • @UtpalKumar the discussion about the WebDriver controlled WebBrowser being blocked might be relevant, but based on the error messages I'm getting, I'm not sure how/if I can confirm that this is the reason I am getting this exception. – Omar Dahleh Jun 20 '22 at 14:15

0 Answers0