I am applying a function that scrapes a url using selenium to a pandas dataframe. I am scraping many websites (on the order of 104). After a 50 or so websites are scraped successfully, I get an InvalidSessionId
error. I only close the driver explicitly at the end of my computation, so I am confused why I am getting this error.
Here is the code sample for reference:
This is the code that scrapes each individual website
def scrape_all_text(url, keyword, wd):
try:
print(str(url))
if (str(url).startswith("http://") or str(url).startswith("https://")):
wd.get(str(url))
else:
wd.get("http://" + str(url))
text = wd.find_element_by_tag_name("body").text.replace('\n', ' ')
print(f"KEYWORD: {keyword}, TEXT: {text}")
return text
except WebDriverException as e:
print(f"KEYWORD: {keyword}, TEXT: {None}, EXCEPTION: {e}")
return None
This is the generator that is supposed to scrape a subset of my websites and yield them to me.
def split_and_scrape(split_percent, df, col_to_add, scrape_func):
num_splits = math.ceil(np.reciprocal(split_percent))
entries_per_split = int(len(df.index) * split_percent)
split_df_list = np.array_split(df, num_splits)
for i, split in enumerate(split_df_list):
wd = webdriver.Chrome('chromedriver',chrome_options=chrome_options)
wd.set_page_load_timeout(20)
print(f"Running on {entries_per_split*i}th - {entries_per_split*(i+1)}th entries")
split[col_to_add] = split.apply(lambda x: scrape_func(x['guess_site_url'], x['keyword'], wd), axis=1)
wd.close()
yield split
This is the error I run into after a while:
InvalidSessionIdException Traceback (most recent call last)
<ipython-input-90-55b98c96b157> in <module>()
2 # wd.set_page_load_timeout(20)
3 # merged['page_contents'] = merged.apply(lambda x: scrape_all_text(x['guess_site_url'], x['keyword'], wd), axis=1) #next put in function where merged saves every few entries
----> 4 for i, split in enumerate(split_and_scrape(0.001, merged, 'page_contents', scrape_all_text)):
5 split.to_csv(f"page_contents_{i}.csv")
3 frames
/usr/local/lib/python3.7/dist-packages/selenium/webdriver/remote/errorhandler.py in check_response(self, response)
245 alert_text = value['alert'].get('text')
246 raise exception_class(message, screen, stacktrace, alert_text) # type: ignore[call-arg] # mypy is not smart enough here
--> 247 raise exception_class(message, screen, stacktrace)
248
249 def _value_or_default(self, obj: Mapping[_KT, _VT], key: _KT, default: _VT) -> _VT:
InvalidSessionIdException: Message: invalid session id
For context, I am running this code in Google Colab, although I am not sure why this would be relevant to the error I'm getting