2

I'm trying to create a script using python in combination with seleniumwire implementing poxies within it. The script sometimes works correctly but most of the time comes up with error log details even when the status_code is 200. I wish to get rid of those log details. The ip address that is hardcoded within the script is taken from a free proxy site, so it might not be of any use at this moment.

This is what I'm trying with:

from seleniumwire import webdriver

URL = 'https://www.zillow.com/Houston,-TX/houses/'

options = {
    'mitm_http2': False,
    'proxy': {'https': f'https://136.226.33.115:80'}
}

driver = webdriver.Chrome(seleniumwire_options=options)

driver.get(URL)
assert driver.requests[0].response.status_code==200
post_links = [i.get_attribute("href") for i in driver.find_elements_by_css_selector("article[role='presentation'] > .list-card-info > a.list-card-link")]
for individual_link in post_links:
    driver.get(individual_link)
    assert driver.requests[0].response.status_code==200
    post_title = driver.find_element_by_css_selector("h1").text
    print(post_title)
driver.quit()

This is the type of error log details I can see in the console:

127.0.0.1:55825: request
  -> HTTP protocol error in client request: Server disconnected
127.0.0.1:55967: request
  -> HTTP protocol error in client request: Server disconnected
127.0.0.1:64891: request
  -> HTTP protocol error in client request: Server disconnected
127.0.0.1:61466: request
  -> HTTP protocol error in client request: Server disconnected
127.0.0.1:51332: request
  -> HTTP protocol error in client request: Server disconnected
127.0.0.1:52783: request
  -> HTTP protocol error in client request: Server disconnected

How can I force the script not to print those log details?

SMTH
  • 67
  • 1
  • 4
  • 17
  • Not sure if I understand correctly but if you only want to get rid of the log lines why avoid to log those cases with a simple if sentence. Otherwise if you want to catch these errors you might want to add another assertion so post_title do not contains the error message: assert 'Server disconnected' not in post_title – pafede2 Jun 14 '21 at 12:26
  • I don't understand which portion of the above post is unclear to you @pafede2. I kicked out `assert` and used `if` statement as you suggested. I also added the line containing `Server disconnected` but the error is still there. I just can't figure out where the heck those error log come from. I never encountered such error when I used `selenium` instead of `seleniumwire`. – SMTH Jun 14 '21 at 13:48
  • What is the full log line that you see? Not only the details! – rfkortekaas Jun 21 '21 at 06:17
  • Check [this](https://filebin.varnish-software.com/jjnejl9tm81mi73g/stacktrace.txt) out @rfkortekaas. The script gets disconnected (which you can see in the log) after one successful loop as I used a single proxy. However, when the proxies are working ones, this stuff always `-> HTTP protocol error in client request: Server disconnected` keeps coming up. – SMTH Jun 21 '21 at 06:50
  • Can you try it with adding `chrome_options.add_experimental_option('excludeSwitches', ['enable-logging'])` as `chrome_options=chrome_options` to `webdriver.Chrome()` where `chrome_options` is `Options()` from `selenium.webdriver.chrome.options` – rfkortekaas Jun 21 '21 at 08:12
  • When I tried like this `options.add_experimental_option('excludeSwitches', ['enable-logging'])`, I got this error `AttributeError: 'dict' object has no attribute 'add_experimental_option'` as I've defined proxies in a dict within the same `options` just before that line. However, I tried like this `options["excludeSwitches"] = ['enable-logging']` but I see that error `-> HTTP protocol error in client request: Server disconnected` still coming up. – SMTH Jun 21 '21 at 08:49
  • Which versions do you use (chrome, chromedriver, selenium and seleniumwire)? If I do the following (replace ; for newline) it's not giving any logging on stdout: `from selenium.webdriver.chrome.options import Options; chrome_options = Options() ; chrome_options.add_experimental_option('excludeSwitches', ['enable-logging']); chrome_options.add_experimental_option('excludeSwitches', ['enable-logging'])` – rfkortekaas Jun 21 '21 at 09:55
  • Sorry, that didn't work eiter. My python version `3.7` and selenium version is `3.141.0`. According to this [documentation](https://pypi.org/project/selenium-wire/), It seems that I need to have `Selenium 3.4.0+`. Can you suggest how can I upgrade selenium version to `3.4+`? – SMTH Jun 21 '21 at 10:36

3 Answers3

3

ORIGINAL POST 06-22-2021 @12:00 UTC

The error:

HTTP protocol error in client request: Server disconnected

Is being thrown by mitmproxy. I pulled the following from mitmproxy source code.

"clientconnect": "client_connected",
"clientdisconnect": "client_disconnected",
"serverconnect": "server_connect and server_connected",
"serverdisconnect": "server_disconnected",


class ServerDisconnectedHook(commands.StartHook):
    """
    A server connection has been closed (either by us or the server).
    """
    blocking = False
    data: ServerConnectionHookData

I would recommend putting your code in a Try Except block, which will allow you to suppress the errors thrown by mitmproxy.

from mitmproxy.exceptions import MitmproxyException
from mitmproxy.exceptions import HttpReadDisconnect

try:
  your driver code
except HttpReadDisconnect as e:
    pass
except MitmproxyException as e:
    """
    Base class for all exceptions thrown by mitmproxy.
    """
    pass
finally:
  driver.quit()

If the exceptions that I provided don't suppress your error than I would recommend trying some of the other Exceptions in mitmproxy.

enter image description here

UPDATE 06-22-2021 @15:28 UTC

In my research I noted that seleniumwire has integration code with mitmproxy. Part of this integration is capturing error message thrown by *mitmproxy."

class SendToLogger:

    def log(self, entry):
        """Send a mitmproxy log message through our own logger."""
        getattr(logger, entry.level.replace('warn', 'warning'), logger.info)(entry.msg)

In my testing suppressing the error in question using mitmproxy.exceptions is difficult. In testing the following exceptions, the only one that fired was HttpReadDisconnect. And that firing wasn't consistent.

  • HttpException
  • HttpReadDisconnect
  • HttpProtocolException
  • Http2ProtocolException
  • MitmproxyException
  • ServerException
  • TlsException

I noted that if I added a standard Exception:

except Exception as error:
    print('standard')
    print(''.join(traceback.format_tb(error.__traceback__)))

That this line in your code consistently throws errors:

 File "/Users/user_name/Python_Projects/scratch_pad/seleniumwire_test.py", line 18, in <module>
    assert driver.requests[0].response.status_code == 200

When I looked at this error in more detail I found that it was related to the status_code.

<class 'AttributeError'>
'NoneType' object has no attribute 'status_code'

UPDATE 06-23-2021 @15:04 UTC

During my research I found that selenium had a service_log_path parameter that could be added to webdriver.Chrome().

class WebDriver(ChromiumDriver):
   
    def __init__(self, executable_path="chromedriver", port=DEFAULT_PORT,
                 options: Options = None, service_args=None,
                 desired_capabilities=None, service_log_path=DEFAULT_SERVICE_LOG_PATH,
                 chrome_options=None, service: Service = None, keep_alive=DEFAULT_KEEP_ALIVE):

According to the documentation this parameter could be used this way: service_log_path=/dev/null

Unfortunately, the comments in class WebDriver(ChromiumDriver) indicated that this parameter is deprecated. It also failed to suppress the sys.stdout error messages.

service_log_path - Deprecated: Where to log information from the driver.

CURRENT STATUS

I reworked your code and removed the status_code lines that were throwing errors. I added some implicitly_wait() and some WebDriverWait statements to handle what you were trying to do with the status_code statement. I also added some error handling to catch specific error message types. And I added some chrome_options to suppress certain things, such a loading website images, which are not needed for scraping the target website.

Finally, I added a custom logging feature to suppress the error messages that were being sent to the sys.stdout. I tested the code many times and so far I haven't received and error message to sys.stdout. More testing might be needed it you get the messages again.

Here is a link of the code in action.

import sys
import logging
import traceback
from seleniumwire import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from mitmproxy.exceptions import HttpReadDisconnect, TcpDisconnect, TlsException


class DisableLogger():
    def __enter__(self):
       logging.disable(logging.WARNING)
    def __exit__(self, exit_type, exit_value, exit_traceback):
       logging.disable(logging.NOTSET)


options = {
    "backend": "mitmproxy",
    'mitm_http2': False,
    'disable_capture': True,
    'verify_ssl': True,
    'connection_keep_alive': False,
    'max_threads': 3,
    'connection_timeout': None,
    'proxy': {
        'https': 'https://209.40.237.43:8080',
    }
}

chrome_options = Options()
chrome_options.add_argument(
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Safari/537.36")
chrome_options.add_argument("--start-maximized")
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-logging')
chrome_options.add_argument("--disable-application-cache")
chrome_options.add_argument("--ignore-certificate-errors")
chrome_options.add_argument("--disable-blink-features=AutomationControlled")
chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_options.add_experimental_option('useAutomationExtension', False)

webdriver.DesiredCapabilities.CHROME['acceptSslCerts'] = True

prefs = {
   "profile.managed_default_content_settings.images": 2,
   "profile.default_content_settings.images": 2
 }

capabilities = webdriver.DesiredCapabilities.CHROME
chrome_options.add_experimental_option("prefs", prefs)
capabilities.update(chrome_options.to_capabilities())

driver = webdriver.Chrome(executable_path='/usr/local/bin/chromedriver',
                          options=chrome_options, seleniumwire_options=options)

with DisableLogger():
    driver.implicitly_wait(60)
    try:
        driver.get('https://www.zillow.com/Houston,-TX/houses/')
        wait = WebDriverWait(driver, 240)
        page_title = wait.until(EC.presence_of_element_located((By.XPATH, '//*[@id="search-page-react-content"]')))
        if page_title:
            post_links = [i.get_attribute("href") for i in driver.find_elements_by_css_selector("article[role='presentation'] > .list-card-info > a.list-card-link")]
            for individual_link in post_links:
                driver.implicitly_wait(60)
                driver.get(individual_link)
                post_title = driver.find_element_by_css_selector("h1").text
                print(post_title)

    except HttpReadDisconnect as error:
        print('A HttpReadDisconnect Exception has occurred')
        exc_type, exc_value, exc_tb = sys.exc_info()
        print(exc_type)
        print(exc_value)
        print(''.join(traceback.format_tb(error.__traceback__)))
        driver.quit()

    except TimeoutException as error:
        print('A TimeOut Exception has occurred')
        exc_type, exc_value, exc_tb = sys.exc_info()
        print(exc_type)
        print(exc_value)
        print(''.join(traceback.format_tb(error.__traceback__)))
        driver.quit()

    except TcpDisconnect as error:
        print('A TCP Disconnect Exception has occurred')
        exc_type, exc_value, exc_tb = sys.exc_info()
        print(exc_type)
        print(exc_value)
        print(''.join(traceback.format_tb(error.__traceback__)))
        driver.quit()

    except TlsException as error:
        print('A TLS Exception has occurred')
        exc_type, exc_value, exc_tb = sys.exc_info()
        print(exc_type)
        print(exc_value)
        print(''.join(traceback.format_tb(error.__traceback__)))
        driver.quit()

    except Exception as error:
        print('An exception has occurred')
        print(''.join(traceback.format_tb(error.__traceback__)))
        pass

    finally:
        driver.quit()

OBSERVATIONS

I noted that you are using free proxies instead of paid proxy service. The proxy in your code hxxps://136.226.33.115:80 I found was a standard HTTP proxy and it also was having latency issues, which was causing timeouts when connecting to your target website.

Another observation is that your target website has captcha, which are fired when you send too many connection requests.

I also noted that your proxy server would also have connection issues, which would cause error messages to be sent to sys.stdout. This is what you were likely encountering.

SIDE NOTE

The selenium session in your code occasionally encounters an I am human captcha from Zillow.

enter image description here

----------------------------------------
My system information
----------------------------------------

Platform: Mac 
Python Version: 3.9
Seleniumwire: 4.3.1
Selenium: 3.141.0
mitmproxy: 6.0.2
browserVersion: 91.0.4472.114
chromedriverVersion: 90.0.4430.24
IDE: PyCharm 2021.1.2

----------------------------------------
Life is complex
  • 15,374
  • 5
  • 29
  • 58
  • Honestly speaking, I chose `seleniumwire` only to be able to use this line `driver.requests[0].response.status_code == 200` to get the status. Pure Selenium doesn't offer anything of that sort to check the status code. – SMTH Jun 22 '21 at 17:09
  • You can check the status code with selenium. If need I can provide that code to you. – Life is complex Jun 22 '21 at 17:31
  • Yeah, sure. Seleniumwire will be of no use to me then. – SMTH Jun 22 '21 at 17:40
  • @SMTH Please let me know if you have any questions about the code that I posted in the *Current Status* section of my answer. – Life is complex Jun 23 '21 at 15:57
  • [This](https://filebin.varnish-software.com/hy2m4ii4khejdexs) is what I experience when I execute your script. I ran the script twice to be sure that what I saw is real. To be specific, the error is still there and I could see neumerous windows opened at the same time. Thanks. – SMTH Jun 23 '21 at 17:04
  • I have never seen those error happen before in my testing, so it is related to your environment. Take at look at my system information that I posted. Also I alway used an *executable_path* for my *chromedriver*. – Life is complex Jun 23 '21 at 17:38
  • I posted a video link in my answer of my code in action. It throw no error messages. – Life is complex Jun 23 '21 at 18:21
  • Looks like it is working perfectly fine on your end. I didn't include chomedriver path within the script only because the path is already added to the environment. – SMTH Jun 23 '21 at 18:38
  • It is working perfectly on my system. Can you please share your system information in your question? – Life is complex Jun 23 '21 at 18:51
  • I'm on Win 7, 32 bit. My python version 3.7 and selenium version is 3.141.0. – SMTH Jun 23 '21 at 20:52
  • What is your browserVersion and chromedriverVersion? – Life is complex Jun 23 '21 at 20:56
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/234132/discussion-between-life-is-complex-and-smth). – Life is complex Jun 23 '21 at 21:19
2

You can define the desired log error level by the following code line:

options .add_argument('--log-level=3')

Add this to your options.

log-level attribute sets the minimum log level.
Valid values are from 0 to 3:

INFO = 0,
WARNING = 1,
LOG_ERROR = 2,
LOG_FATAL = 3.

default is 0.

Prophet
  • 32,350
  • 22
  • 54
  • 79
  • I tried your suggestion but that didn't work out. In fact that was my first attempt to suppress those errors before posing the question. I tried all your provided options by the way. – SMTH Jun 14 '21 at 13:39
  • maybe it's just a different syntax? I see you used different syntax for what you defined in `options` in your code. So maybe you should just match syntaxes? Currently I see no additional ways to do that. – Prophet Jun 14 '21 at 13:47
  • Syntax is not the issue here. I used them correctly. FYI, this is the right syntax `c_options.add_argument('--log-level=3')` and finally `webdriver.Chrome(options=c_options,seleniumwire_options=options)` – SMTH Jun 14 '21 at 13:51
  • I know what I wrote is with correct syntax, but you used some other syntax in your code `options = { 'mitm_http2': False, 'proxy': {'https': f'https://136.226.33.115:80'} }` – Prophet Jun 14 '21 at 14:12
  • You forgot to go through my last comment. This is how you can use two type of options in there `webdriver.Chrome(options=c_options,seleniumwire_options=options)` – SMTH Jun 14 '21 at 14:24
  • Ah, sorry. Actually I'm using Java so I do not really familiar with these details in Python :) – Prophet Jun 14 '21 at 14:25
1

if you are working in a linux distribution, you can redirect error outputs. to do so, you should add 2>/dev/null to the end of your command. for example, you can run your script like this:

python SCRIPT 2>/dev/null
Ahmad
  • 11
  • 2