0

I have a selenium app that does some work and then saves a screenshot of an image as png:

img = driver.find_element_by_xpath('//div[@id="qrcode"]/img')

with open('image.png', 'wb') as f:
     f.write(img.screenshot_as_png)

and then it will be texted to me.

This was all working fine until I introduced headless:

chrome_options = Options()
chrome_options.add_argument('--headless')

driver = webdriver.Chrome(executable_path="C:/chromedriver.exe", options=chrome_options)

Now, for some reason, it saves only the top half of the image. I took the headless argument away, and it works perfectly. Any suggestions?

demouser123
  • 4,108
  • 9
  • 50
  • 82
Fishy
  • 76
  • 7
  • This question has highly upvoted answer here - https://stackoverflow.com/questions/51653344/taking-screenshot-of-whole-page-with-python-selenium-and-firefox-or-chrome-headl . You should check this out. – demouser123 Sep 12 '21 at 02:02
  • please add the website information – PDHide Sep 12 '21 at 09:05

3 Answers3

2

It is not the Screenshotting that is causing the error. The resolution of headless browser is just simply different than the normal browser.

You can manually adjust your Selenium driver window size by this command. Try changing the size and see if you can get the full resolution image.

chrome_options.add_argument("window-size=1400,600")

You may also try:

chrome_options.add_argument("--start-maximized")
Hammad
  • 529
  • 7
  • 17
1

While @Hammad solutions seems a reasonable fix, even if that did not work and you are interested in request module then you can try the below code to take the screenshot.

import requests

path = 'target.jpg'
response = requests.get("Image SRC/URL here", stream=True)

if response.status_code == 200:
    with open(path, 'wb') as file:
        for pic in response:
            file.write(pic)

also in place of URL here, you can pass

img = driver.find_element_by_xpath('//div[@id="qrcode"]/img').get_attribute('src')

Hammad
  • 529
  • 7
  • 17
cruisepandey
  • 28,520
  • 6
  • 20
  • 38
  • The problem with this method is that I have to first go through a login process before I get to the page with the QR code. The image will change daily, so I cannot use a previous set link. – Fishy Sep 13 '21 at 22:34
0

Some websites have anti-scraping mechanisms that involves in detecting the webdriver property of the browser. When you enable the headless mode for Chrome this property is not set by the browser. Thus indication to websites that the origin of the request is through a bot or a program.

You can try to execute javascript that can set the webdriver property for your browser in headless mode.

However, please also note that this is just one of many mechanisms used by websites to detect bots or programs.

You may also check this answer

Here is a sample code I wrote using pyppeteer library.

import asyncio
from pyppeteer import launch
# from pyvirtualdisplay import Display
from argparse import ArgumentParser


class HTMLRetriever(object):
    _page_source = None
    _title = None

    def __init__(self, url):
        self.url = url

    async def load(self):
        # with Display(backend='xvfb') as disp:
        await self._init_browser()
        await self._init_webpage()
        await self._connect_website()
        await self._take_snapshot()

    @classmethod
    async def _init_display(cls):
        cls.disp = Display(backend='xvfb')

    @classmethod
    async def _init_browser(cls):
        cls.browser = await launch(headless=True, args=[
        "--no-sandbox",
        "--single-process",
        "--disable-dev-shm-usage",
        "--no-zygote",
        '--user-agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36"'
    ])

    @classmethod
    async def _init_webpage(cls):
        cls.page = await cls.browser.newPage()
        await asyncio.sleep(1)
        await cls.page.setJavaScriptEnabled(True)

    @classmethod
    async def _init_webpage_properties(cls):
        await cls.page.evaluate('''() =>{
            Object.defineProperties(navigator,{
            webdriver:{
                get: () => false
                }
            })
        }''')

        await cls.page.evaluate('''() =>{
            Object.defineProperties(window,{
            chrome:{
                get: () => true
                }
            })
        }''')

    async def _connect_website(self):
        await self.page.goto(self.url, {'waitUntil': 'networkidle2', 'timeout': 60000})
        await asyncio.sleep(6)
        self._title = await self.page.evaluate('''() => {
            return document.title
        }''')

        self._page_source = await self.page.content()

    async def _take_snapshot(self):
        await self.page.screenshot({'path': f"snapshots/{self.url.strip('https://').strip('http://').replace('.', '_').replace('/','-')}.png"})

    @property
    def page_source(self):
        return self._page_source

    @property
    def title(self):
        return self._title

    async def close(self):
        await self.browser.close()


async def main():
    parser = ArgumentParser(description='A tool to obtain HTMl of a web URL')
    parser.add_argument('-u', '--url', dest='url', type=str, required=True, metavar='URL',
                        help='URL of the website for which HTML is to be retrieved')
    args = parser.parse_args()
    kwargs = vars(args)
    if not kwargs.get('url') is None:
        retriever = HTMLRetriever(url=kwargs.get('url'))
        await retriever.load()
        print(retriever.title)
        await retriever.close()

if __name__ == '__main__':
    asyncio.get_event_loop().run_until_complete(main())
  • I doubt that the website I am taking info from has anti-scraping measures for 2 reasons: 1. The website ToS says nothing about scraping 2. I have done similar things on a website run by the same domain without any problem – Fishy Sep 11 '21 at 23:34