1

I have a problem with the following code

import re
from lxml import html
from bs4 import BeautifulSoup as BS
from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
import requests
import sys
import datetime

print ('start!')
print(datetime.datetime.now())

list_file = 'list2.csv'
#This should be the regular input list

url_list=["http://www.genecards.org/cgi-bin/carddisp.pl?gene=ENO3&keywords=ENO3"]
#This is an example input instead

binary = FirefoxBinary('C:/Program Files (x86)/Mozilla Firefox/firefox.exe')
#Read somewhere it could be a variable useful to supply but anyway, the program fails randomly at time with [WinError 6] Invalid Descriptor while having nothing different from when it is able to at least get the webpage; even when not able to perform further operation.

for page in url_list:
    print(page)
    browser = webdriver.Firefox(firefox_binary=binary)
    #I tried this too to solve the [WinError 6] but it is not working
    browser.get(page)
    print ("TEST BEGINS")
    soup=BS(browser.page_source,"lxml")
    soup=soup.find("summaries")
    # This fails here. It finds nothing, while there is a section id termed summaries. soup.find_all("p") works but i don't want all the p's outside of summaries
    print(soup) #It prints "None" indeed.
     print ("TEST ENDS")

I am positive source code includes "summaries". First there is

 <li> <a href="#summaries" ng-click="scrollTo('summaries')">Summaries</a></li>

then there is

 <section id="summaries" data-ga-label="Summaries" data-section="Summaries">

As suggested here (Webscraping in python: BS, selenium, and None error) by @alexce, I tried

 summary = soup.find('section', attrs={'id':'summaries'})

(Edit: the suggestion was _summaries but I did tested summaries too)

but it does not work either. So my questions are: why does BS not find the summaries, and why does selenium keep breaking when I use the script too much in a row (restarting a console works, on the other hand, but this is tedious), or with a list comprising more than four instances? Thanks

Community
  • 1
  • 1
Ando Jurai
  • 1,003
  • 2
  • 14
  • 29
  • I tested many solutions presented [here](http://stackoverflow.com/questions/2136267/beautiful-soup-and-extracting-a-div-and-its-contents-by-id) and it doesn't work. So I guess it has to do with my specific page... I also tried to use other things that selenium (robobrowser, mechanical soup), but package are not available under windows... – Ando Jurai Mar 16 '16 at 14:41

2 Answers2

1

This:

summary = soup.find('section', attrs={'id':'_summaries'})

Search for element section that have the attribute id set to _summaries:

<section id="_summary" />

There is no element with these attribute in the page.
The one that you want is probably <section id="summaries" data-ga-label="Summaries" data-section="Summaries">. And can be matched with:

 results = soup.find('section', id_='summaries')

Also, side note on why you use Selenium. The page will return an error if you do not forward cookies. So in order to use requests, you need to send cookies.

My full code:

  1 from __future__ import unicode_literals
  2 
  3 import re
  4 import requests
  5 from bs4 import BeautifulSoup as BS
  6 
  7 
  8 data = requests.get(
  9     'http://www.genecards.org/cgi-bin/carddisp.pl?gene=ENO3&keywords=ENO3',
 10     cookies={
 11         'nlbi_146342': '+fhjaf6NSntlOWmvFHlFeAAAAAAwHqv5tJUsy3kqgNQOt77C',
 12         'visid_incap_146342': 'tEumui9aQoue4yMuu9tuUcly6VYAAAAAQUIPAAAAAABcQsCGxBC1gj0OdNFoMEx+',
 13         'incap_ses_189_146342': 'bNY8PNPZJzroIFLs6nefAspy6VYAAAAAYlWrxz2UrYFlrqgcQY9AuQ=='
 14     }).content
 15 
 16 soup=BS(data)
 17 results=soup.find_all(string=re.compile('summary', re.I))
 18 print(results)
 19 summary_re = re.compile('summary', re.I)
 20 results = soup.find('section', id_='summaries')
 21 print(results)
Cyrbil
  • 6,341
  • 1
  • 24
  • 40
  • Sorry, i wrote as suggestion said before, but I had tried summary = soup.find('section', attrs={'id':'summaries'}). Thanks for your suggestion, but why do I have to use "id_" with an underscore? And why isn't it considered as an attribute (if we considere using attrs={'id':'summaries'})) ? There is a subtlety that I am missing, but I guess it is because I mostly know rudimentary html, I know how to read it, but I am not aware of its "grammar". So maybe I confused an attribute with something else. – Ando Jurai Mar 16 '16 at 15:59
  • Regarding cookies: I feel I can't use requests since the page is protected against bots and block my scrapper (I am only doing something I would do by hand, just copy pasting things at human speed, but with me actually working at the same time). Also, are these cookies standard, or where should I find a list of these? As I said, I know a bit of many things, but this happens to largely surpass my skill level and knowledge. – Ando Jurai Mar 16 '16 at 16:03
  • 1
    `id` is a python keyword, so you have to put an underscore, and beautifulsoup will understant it as `id`. The second option is to write `attrs={'id':'summaries'}` where `'id'` is a string and isn't interpreted as a python keyword. To bypass the cookies, you need to first get the page once (with option verb) get the cookies, and resend a requests to get the page. It's basically what does your browser and so firefox with selenium. – Cyrbil Mar 16 '16 at 16:20
  • Oh, ok, I had no idea about that use of underscore in python. I could not even start to think it was something from the syntax, this is not something I learnt from basic tutorials. So for completeness and future users, you can read https://www.python.org/dev/peps/pep-0008/#descriptive-naming-styles and http://stackoverflow.com/questions/16095188/is-there-a-python-naming-convention-for-avoiding-conflicts-with-standard-module. Thanks again. – Ando Jurai Mar 17 '16 at 08:18
  • To elaborate further on the solution, soup = soup.find('section', id_='summaries') doesn't work, still, soup = soup.find('section', attrs={'id':'summaries'}) works like a charm. I also had tested soup.summaries as seen on BS docs but it did not work either – Ando Jurai Mar 17 '16 at 08:43
1

The element is probably not yet on the page. I would wait for the element before parsing the page source with BS:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Firefox()
driver.get("http://www.genecards.org/cgi-bin/carddisp.pl?gene=ENO3&keywords=ENO3")
wait = WebDriverWait(driver, 10)
wait.until(EC.visibility_of_element_located((By.ID, "summaries")))
soup = BS(driver.page_source,"lxml")

I noticed that you never call driver.quit(), this may be the reason of your breaking issues. So make sure to call it or try to reuse the same session.

And to make it more stable and performant, I would try to work with the Selenium API as mush as possible since pulling and parsing the page source is expensive.

Florent B.
  • 41,537
  • 7
  • 86
  • 101
  • I had a driver.quit() in my loop, but discarded it for some reason. I also tried driver.close() but it also break. – Ando Jurai Mar 16 '16 at 16:47
  • Thanks. I tried to wait with time or WebDriverWait, and had a driver.quit() in my loop, but discarded it as it didn't correct the problem. I also tried driver.close() but it also breaks. I know this is expensive but I only tried to reproduce what I have seen before. Actually I could not make the driver.find_element_by_id things from selenium to work neither, so I tried to use the most used thing on the internet (I am not good enough to pretend reinventing the wheel, I can't already make a round one roll...) – Ando Jurai Mar 16 '16 at 16:54