0

I have written a simple web scraping code using Selenium but I want to scrape only the portion that is present 'before scroll'

Say, if it is this page I want to scrape - https://en.wikipedia.org/wiki/Pandas_(software) - Selenium reads information till the absolute last element/text which for me is the 'Powered by Media Wiki' button on the far bottom-right of the page.

enter image description here

What I want Selenium to do is stop after DataFrames (see screenshot) and not scroll down to the bottom.

Page as it opens

And I also want to know where on the page it stops. I have checked multiple sources and most of them ask for infinite scroll websites. No one asks for just the 'visible' half of a page.

This is my code now:

from selenium import webdriver

EXECUTABLE = r"chromedriver.exe"

# get the URL
url = "https://en.wikipedia.org/wiki/Pandas_(software)"

# open the chromedriver
driver = webdriver.Chrome(executable_path = EXECUTABLE)

# google window is maximized so that all webpages are rendered in the same size  
driver.maximize_window()

# make the driver wait for 30 seconds before throwing a time-out exception
driver.implicitly_wait(30)

# get URL
driver.get(url)

for element in driver.find_elements_by_xpath("//*"):
        try:
            #stuff
        except:
            continue

driver.close()

Absolutely any direction is appreciated. I have tried to be as clear as possible here but let me know if any more details are required.

Nilima
  • 197
  • 1
  • 2
  • 9

1 Answers1

0

I don't think that is possible. Observe the DOM, all the informational elements are under one section I mean one tag div[@id='content'], which is already visible to Selenium. Even if you try with //*, div[@id='content'] is visible.

And trying to check whether the element is visible though not scrolled, will also return True. (If someone knows to do what you are asking for, even I would like to know.)

from selenium import webdriver
from selenium.webdriver.support.expected_conditions import _element_if_visible

driver = webdriver.Chrome(executable_path = 'path to chromedriver.exe')
driver.maximize_window()
driver.implicitly_wait(30)

driver.get("https://en.wikipedia.org/wiki/Pandas_(software)")

elements = driver.find_elements_by_xpath("//div[@id='content']//*")
for element in elements:
    try:
        if _element_if_visible(element):
            print(element.get_attribute("innerText"))
    except:
        break

driver.quit()
pmadhu
  • 3,373
  • 2
  • 11
  • 23
  • Yes. This doesn't work. And I haven't been able to find a solution in any of the documentation or here. I figured someone must have worked it out hence asked it here. Seems like this should be possible since so many online marketing materials - banners, game-elements, shop buttons and whatnot play with visible section of the screen. – Nilima Sep 16 '21 at 17:49
  • Selenium interacts with HTML DOM with lot of HTTP request. Your post heading is misleading `Scraping only the portion that loads - Without Scrolling` , it is an enhancement in Selenium that it scrolls down to expand it's view port. we are glad to have this feature. If you want to scroll to a particular web element which is in Selenium view port, you should ideally use ActionChain, and in extreme cases where JavaScript is required you should use `scrollIntoView` @Nilima. Also the page that you have shared does not have a long vertical scroll bar. – cruisepandey Sep 16 '21 at 18:22
  • @cruisepandey I do not want it to scroll - it doesn't matter whether a page has or does not have a long vertical scroll bar. As long as it has an active scroll bar, I do not want my program to capture things that are 'out of view' upon load. I do not want to scroll to a particular web element for a scrape either. To be clearer, the viewport of my current screen is approximately 1920x1200. And I just want my program to scrape all data that is loaded in thsi 1920x1200 without any movement. Also, out of curiosity - why did you say my heading is misleading? – Nilima Sep 17 '21 at 05:09
  • @Nilima - `"visible section of the screen"` What we see on the screen is different from `Visibility` of Selenium. When you do `driver.get("url")`, `driver.page_source` has everything staring from `` to ``. – pmadhu Sep 17 '21 at 05:27
  • @pmadhu I understand that - that is my exact issue. Which is why I am asking if there is a function that can mimic our IRL webpage 'visibility'. – Nilima Sep 17 '21 at 06:05
  • @Nilima : viewport is different entity in Selenium, what we see from naked eyes is different. Selenium natively talks to `HTMLDOM`. I am not sure why would one have this use case ? for scraping locators are required, may be you need to tweak them instead of wondering why Selenium has it all, it will have a different view port. It does not see the UI rather it is based on Rest request made by webdriver. – cruisepandey Sep 17 '21 at 11:50
  • @cruisepandey Yeah. As I feared. The functionality doesn't exist. As to why anyone would have this use case - this is an extremely important use case in marketing studies where people 'walk out' from a page without scrolling down to see the entire page - so, as a website creator, all your attention capturing stuff should be in the 'visible' area. As on now, any study around this is done using the very ineffective A/B testing - where every digital giant move their elements around the page to see what works well. – Nilima Sep 17 '21 at 13:06
  • @Nilima : The functionality does not exists cause nobody really needs it. again scrapping is based on locator you choose, whatever locator you feed to Selenium, it will look for the same nodes in HTMLDOM, if found return it if not, raise the exception. – cruisepandey Sep 20 '21 at 07:05