2

I wanted to use python webscraping to feed an ml application I did that would make a summary of summaries to ease my daily research work. I seem to meet some difficulties as while I have been using a lot of suggestions on the web, such as this one:
Python Selenium accessing HTML source I keep getting the AttributeError: 'NoneType' object has no attribute 'page_source'/'content' depending on the tries/used modules I need this source to feed beautiful soup to scrape the source and find my ml script. My first attempt was to use requests:

from bs4 import BeautifulSoup as BS
import requests
import time
import datetime
print ('start!')
print(datetime.datetime.now())

page="http://www.genecards.org/cgi-bin/carddisp.pl?gene=COL1A1&keywords=COL1A1"

This is my target page. I usually do like 20 requests a day, so it's not like I wanted to vampirize the website, and since I need them at the same moment, I wanted to automate the retrieval task since the longest part is to get the url, load it, copy and paste the summaries. I am also reasonnable since I respect some delays before loading another page. I tried passing as a regular browser since the site doesn't like robots (it disallows /ProductRedirect and a thing with a number I could not find in google?)

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:44.0) Gecko/20100101 Firefox/44.0'}
current_page = requests.get(page,  headers=headers)
print(current_page)
print(current_page.content)
soup=BS(current_page.content,"lxml")

I always end up getting no content, while request get code 200 and I can load this page by myself in firefox. So i tried with Selenium

from bs4 import BeautifulSoup as BS
from selenium import webdriver
import time
import datetime
print ('start!')
print(datetime.datetime.now())

browser = webdriver.Firefox()
current_page =browser.get(page)
time.sleep(10)

this works and loads a page. I added the delay to be sure not to spam the host and to be sure to fully load the page. then neither:

html=current_page.content

nor

html=current_page.page_source

nor

html=current_page

works as an input for:

soup=BS(html,"lxml")

It always ends up saying that it doesn't have the page_source attribute (while it should have since it loads correctly in the selenium invoked web browser window).

I don't know what to try next. It's like the user-agent header was not working for requests, and it is very strange that selenium returned page has no source.

What could I try next? Thanks.

Note that I also tried:

browser.get(page)
time.sleep(8)
print(browser)
print(browser.page_source)
html=browser.page_source
soup=BS(html,"lxml")
for summary in soup.find('section', attrs={'id':'_summaries'})
    print(summary)

but while it can get the source, it just fails at BS stage with ; "AttributeError: 'NoneType' object has no attribute 'find'"

Community
  • 1
  • 1
Ando Jurai
  • 1,003
  • 2
  • 14
  • 29
  • 1
    have you tried the regular html parser? `soup = BS(html, "html.parser")` – gariepy Mar 03 '16 at 16:21
  • I just did. I use lxml because they recommend it. Anyway html.parser still get NoneType' object has no attribute 'find'". I am trying new things from the last solution which is able to print the source, but I don't get why BS still doesn't want to parse it, once the robot thing seems to be passed... – Ando Jurai Mar 03 '16 at 16:37

2 Answers2

2

The problem is that you are trying to iterate over the result of .find(). Instead you need .find_all():

for summary in soup.find_all('section', attrs={'id':'_summaries'})
    print(summary)

Or, if there is a single element, don't use a loop:

summary = soup.find('section', attrs={'id':'_summaries'})
print(summary)
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Ok, it seems better, indeed. Thanks. It is not quite the original question, but apart of the syntax error, I tried to iterate over the find_all iterable because I need to get a certain section this "_summaries" id, then scrape it again for its parts (between

    tags that are filling the document and are all over the place). Have you a suggestion for that? Is nested BS objects a good practice or can I achieve that in one command?

    – Ando Jurai Mar 03 '16 at 17:08
  • 1
    @AndoJurai sure, that `summary` variable inside a loop is a BS `Tag` instance - you can search inside it as with a regular soup object: `[p.get_text() for p in summary.find_all("p")]` for instance. Hope that helps. – alecxe Mar 03 '16 at 17:10
  • Thanks. actually soup.find_all("p") works (while getting some text that I don't want), but neither summary.find_all("p"), neither soup.summary.find_all("p") do. as the section I am interested in is
    ; i also tried _summaries, summaries, Summaries, as an identifier, but all of these come back with this attribute error. using for a in soup.find_all(re.compile("Summ")) gets nothing while for a in soup.find_all(re.compile("section")) gets too many things. I can't really wrap my head around BS workings...
    – Ando Jurai Mar 04 '16 at 09:46
  • 1
    @AndoJurai could you please elaborate that in a separate question providing the current code you have, the HTML source of the page and point what problems are you experiencing? Thanks! – alecxe Mar 04 '16 at 14:57
  • Yes, I am going to do that, it will be the best way. Thanks – Ando Jurai Mar 04 '16 at 16:53
1

You shouldn't have to convert the html to a string object.

Try:

html = browser.page_source
soup = BS(html,"lxml")
Brandon
  • 50
  • 7
  • Yes, actually this works, the str thing was one of my tries. I still don't really understand why you can't assign browser.get(page) to an object, then ask it the page_source, for me it's puzzling. I'm really not familiar with this kind of object management; it's different from using constructors and so on. – Ando Jurai Mar 04 '16 at 08:45