Python Web Scraping (Beautiful Soup, Selenium and PhantomJS): Only scraping part of full page

Question

Hello I am having trouble trying to scrape data from a website for modeling purposes (fantsylabs dotcom). I'm just a hack so forgive my ignorance on comp sci lingo. What Im trying to accomplish is...

Use selenium to login to the website and navigate to the page with data.

## Initialize and load the web page
url = "website url"
driver = webdriver.Firefox()
driver.get(url)
time.sleep(3)

## Fill out forms and login to site
username = driver.find_element_by_name('input')
password = driver.find_element_by_name('password')
username.send_keys('username')
password.send_keys('password')
login_attempt = driver.find_element_by_class_name("pull-right")
login_attempt.click()

## Find and open the page with the data that I wish to scrape
link = driver.find_element_by_partial_link_text('Player Models')
link.click()
time.sleep(10)

##UPDATED CODE TO TRY AND SCROLL DOWN TO LOAD ALL THE DYNAMIC DATA
scroll = driver.find_element_by_class_name("ag-body-viewport")
driver.execute_script("arguments[0].scrollIntoView();", scroll)

## Try to allow time for the full page to load the lazy way then pass to BeautifulSoup
time.sleep(10)
html2 = driver.page_source

soup = BeautifulSoup(html2, "lxml", from_encoding="utf-8")
div = soup.find_all('div', {'class':'ag-pinned-cols-container'})
## continue to scrape what I want

This process works in that it logs in, navigates to the correct page but once the page finishes dynamically loading (30 seconds) pass it to beautifulsoup. I see about 300+ instances in the table that I want to scrape.... However the bs4 scraper only spits out about 30 instances of the 300. From my own research it seems that this could be an issue with the data dynamically loading via javascript and that only what is pushed to html is being parsed by bs4? (Using Python requests.get to parse html code that does not load at once)

It may be hard for anyone offering advice to reproduce my example without creating a profile on the website but would using phantomJS to initialize the browser be all that is need to "grab" all instances in order to capture all the desired data?

    driver = webdriver.PhantomJS() ##instead of webdriver.Firefox()

Any thoughts or experiences will be appreciated as Ive never had to deal with dynamic pages/scraping javascript if that is what I am running into.

UPDATED AFTER Alecs Response:

Below is a screen shot of the targeted data (highlighted in blue). You can see the scroll bar in the right of the image and that it is embedded within the page. I have also provided a view of the page source code at this container.

I have modified the original code that I provided to attempt to scroll down to the bottom and fully load the page but it fails to perform this action. When I set the driver to Firefox(), I can see that the page moves down on via the outer scroll bar but not within the targeted container. I hope this makes sense.

Thanks again for any advice/guidance.

Mastering the art of being a hack is what programming is all about. — Darren Ringer, Mar 06 '17 at 16:08

score 2 · Answer 1 · edited May 23 '17 at 12:00

2

It's not easy to answer since there is no way for us to reproduce the problem.

One problem is that the lxml is not handling this specific HTML particularly well and you may need to try changing the parser:

soup = BeautifulSoup(html2, "html.parser")
soup = BeautifulSoup(html2, "html5lib")

Also, there might not be a need in BeautifulSoup in the first place. You can locate elements with selenium in a lot of different ways. For example, in this case:

for div in driver.find_elements_by_css_selector(".ag-pinned-cols-container'"):
    # do smth with 'div'

It may also be that the data is dynamically loaded when you scroll the page to bottom. In this case, you may need to scroll the page to bottom until you see the desired amount of data or there are no more new data loaded on scroll. Here are the relevant thread with sample solutions:

edited May 23 '17 at 12:00

Community

1
1

answered Jan 14 '16 at 01:35

alecxe

462,703
120
1,088
1,195

Thank you for your input alec, you have pointed me in the right direction with the scroll page to bottom. I haven't run into this issue before. The links help, but I still can't seem to get it right. It appears that the targeted data is in a separate container within the webpage and has its own separate scroll bar. I have edited my initial question with a screen shot and some updated code that still needs to be worked. – boothtp Jan 14 '16 at 13:19
@boothtp good, I think [this answer](http://stackoverflow.com/a/30942319/771848) should be the most relevant meaning the idea would be to scroll into view of the last row in the table to trigger the dynamic load. You just need to fix the locators. This is still a guess though. Hope it helps. – alecxe Jan 14 '16 at 16:17
Thanks again. So I was able to work with your suggestions for a few hours today. I still am not able to target the scroll bar in the image above, so any other guidance would be helpful... i cant inspect it either... so do I target the container? In addition, i found that when i manually scroll down, the data dynamically updates but it only shows about 40 instances at one time... For example, if I load the page i see instances 1-40... if I scroll down further ill see, say instances 20-60 and 1-20 disappear from the source code... How would you go about capturing the data whn in this case? – boothtp Jan 15 '16 at 03:58

Python Web Scraping (Beautiful Soup, Selenium and PhantomJS): Only scraping part of full page

1 Answers1

Linked