6

I have tried to write the following code, I am trying to write a code in Python 3.7 that just opens a web browser and the website fed to it in the Command Line:

Example.py

import sys

from mechanize import Browser
browser = Browser()

browser.set_handle_equiv(True)
browser.set_handle_gzip(True)
browser.set_handle_redirect(True)
browser.set_handle_referer(True)
browser.set_handle_robots(False)

# pretend you are a real browser
browser.addheaders = [('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36')]

listOfSites = sys.argv[1:]
for i in listOfSites:
    browser.open(i)

I have entered the following command in the cmd:

python Example.py https://www.google.com

And I have the following traceback:

Traceback (most recent call last):
  File "Example.py", line 19, in <module>
    browser.open(i)
  File "C:\Python37\lib\site-packages\mechanize\_mechanize.py", line 253, in open
    return self._mech_open(url_or_request, data, timeout=timeout)
  File "C:\Python37\lib\site-packages\mechanize\_mechanize.py", line 283, in _mech_open
    response = UserAgentBase.open(self, request, data)
  File "C:\Python37\lib\site-packages\mechanize\_opener.py", line 188, in open
    req = meth(req)
  File "C:\Python37\lib\site-packages\mechanize\_urllib2_fork.py", line 1104, in do_request_
    for name, value in self.parent.addheaders:
ValueError: too many values to unpack (expected 2)

I am very new to Python. This is my first code here. I am stuck with the above traceback but haven't found the solution yet. I have searched for a lot of questions on SO community as well but they didn't seem to help. What should I do next?

UPDATE:

As suggested by @Jean-François-Fabre, in his answer, I have added 'User-agent' to the header, now there is no traceback, but still there is an issue where my link cannot be opened in the browser.

Here is how the addheader looks like now:

browser.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36')]
Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219
Code_Ninja
  • 1,729
  • 1
  • 14
  • 38

5 Answers5

3

I have just found out a way around to this issue even if the above issue still exists. I am posting this only to let the readers know that we can do it this way too:

Instead of using the mechanize package, we can use the webbrowser package and write the following python code in the Example.py:

import webbrowser
import sys

#This is an upgrade suggested by @Jean-François Fabre
listOfSites = sys.argv[1:]

for i in listOfSites:
    webbrowser.open_new_tab(i)

Then we can run this python code by executing the following command in the terminal/command prompt:

python Example.py https://www.google.com https://www.bing.com

This command mentioned above in the example will open two sites at a time. One is Google and the other is Bing

Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219
Code_Ninja
  • 1,729
  • 1
  • 14
  • 38
  • This is the _correct_ solution as `mechanize` does not open a graphical user interface (AFAIK). – tafaust Feb 11 '19 at 10:43
  • 1
    @ThomasHesse are you sure, because when I was trying to do the above implementation, one of the methods out of a bunch of others was to use `Mechanize`. And the code worked for sometime, but the next day when I tried to run the code, this issue happened, not sure what happened in a single night. – Code_Ninja Feb 11 '19 at 11:36
2

I don't know mechanize at all, but the traceback and variable names (and some googling) can help.

You're initializing addheaders with a list of strings. Some other examples (ex: Mechanize Python and addheader method - how do I know the newest headers?) show a list of tuples, which seem to match the traceback. Ex:

browser.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]

so it unpacks properly into name and value in loop

for name, value in whatever.addheaders:

You have to add the 'User-agent' property name (you can pass other less common parameters than the browser name)

Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219
1

Let me try to answer your question in parts:

  • You're correct in adding "Browser Headers". Many servers might outright drop your connection as it's a definite sign of being crawled by a bot.

  • mechanize as stated by the docs "is a tool for programmatic web browsing".
    This means it is primarily used to crawl webpages, parse their contents, fill forms, click on things, send requests, but not using a "real" web browser, with parts such as CSS rendering. For example you cannot open a page and take a screenshot, as there's not something "rendered", and to achieve this you would need to save the page, and render it using another solution.

  • If this suits you needs, check headless browsers as a technology, there are a lot of them. In the Python ecosystem, other than mechanize, I'd check "headless chromium", as "phantomjs" is unfortunately discontinued.

But if I understand correctly, you need the actual web browser to open up with the webpage, right? For this reason, you actually need, well, a browser in your system to take care of that!

Case 1 : Use your native system's browser

Find out where your browser's executable lies in your system. For example, my Firefox executable lies in "C:\Program Files\Mozilla Firefox\firefox.exe" and add it to your PATH.

As you're using Windows, use the start menu to navigate to Advanced System Settings --> Advanced --> Environment Variables, and add the path above to your PATH variable.

If you're using Linux export PATH=$PATH:"/path/to/your/browser" will take care of things.

Then, your code can run as simply as

import subprocess
import sys

listOfSites = sys.argv[1:]
links = ""
for i in listOfSites:
    links += "-new-tab " + i
print(links)
subprocess.run(["firefox", links])

Firefox will open new windows, one for each of the links you have provided.

Case 2 : Use selenium

Then comes Selenium, which in my opinion is the most mature solution to browser-related problems, and what most people use. I've used it in a production setting with very good results. It provides both the UI/frontend of a browser that renders the webpages, but also allows you to programmatically work with these webpages.

It needs some setup, (for example, if you're using Firefox, you'll need to download the geckodriver executable from their releases page, and then add it your PATH variable again.

Then you define your webdriver, spawn one for each of the websites you need to visit, and get the webpage. You can also take screenshot, as a proof that the page has been rendered correctly.

from selenium import webdriver
import sys

listOfSites = sys.argv[1:]
for i in listOfSites:
    driver = webdriver.Firefox()
    driver.get('http://'+i)
    driver.save_screenshot(i+'-screenshot.png')

# When you're finished
# driver.quit()

I've tested both of these code snippets, and they work as expected. Please let me know how all these sound, and if you need any more additional information..! ^^

hyperTrashPanda
  • 838
  • 5
  • 18
  • 1
    I am just trying to create a script to surf the net, its not a dependency to use actual web browser, and Selenium is something that is mostly used by Automation Testers, I primarily need to focus on python as this something that I need to learn as best as possible for AI development. Can you please elaborate more how can I avoid an actual browser but still surf the net? – Code_Ninja Feb 11 '19 at 13:16
  • What does "surf the net" consist of? If you want to render and display the webpages, you *need* a browser, and I don't think you can replace its functionality, (or at least I don't know of a way).. If you want to just crawl from URL to URL and obtain the HTML only, `mechanize` or `webbrowser` are good to go, but I'd recommend also looking at the `requests` package, but you *won't* have a graphical user interface. Selenium is indeed primarily used for automation testing, but it's quite simple to grasp, and I think has its uses for other tasks in the industry. – hyperTrashPanda Feb 11 '19 at 13:25
  • Its okay, I dont need a graphical interface, all I need is to search and get the search result and process form data, what data should be passed and what should be the value that I need from that query.. that's what I am hoping for in _surf the net_ – Code_Ninja Feb 11 '19 at 13:27
  • How can I use `mechanize / webbrowser / requests` package for this, since I am very new to python. Is there any link or API that I refer to for the reference of these packages? – Code_Ninja Feb 11 '19 at 13:29
  • Ah, okay, now I understand your requirements in a better way, thanks for the clarification! For your specific use case, the best thing IMO, is indeed `mechanize`. Keep the [official docs](https://mechanize.readthedocs.io/en/latest/browser_api.html) nearby, as well as [these](https://www.pythonforbeginners.com/mechanize/browsing-in-python-with-mechanize) [and](http://toddhayton.com/2014/12/08/form-handling-with-mechanize-and-beautifulsoup/) [these](http://www.open-bigdata.com/data-mining-web-python-mechanize-tutorial/) tutorials, and start hacking away! StackOverflow is also here for questions – hyperTrashPanda Feb 11 '19 at 13:37
  • Final thing, if you *need* to load and execute javascript in your webpages, Selenium is a must, but for now, I think you're free to ignore it. – hyperTrashPanda Feb 11 '19 at 13:37
  • 1
    I dont need to load Javascript into the webpage, I am just trying to create a bot that can surf the net, lets see what else I can explore in this while making this. I am seriously excited to learn about AI and how can I make one using Python. – Code_Ninja Feb 11 '19 at 13:39
  • Have fun in doing so! Sentiment Analysis in webpages is a good idea that can be explored, and could be straightforward if you have the parsed data. [Alexa's Top 500](https://www.alexa.com/topsites) list is a nice starting point. – hyperTrashPanda Feb 11 '19 at 13:45
  • is there any way to debug python script in notepad++, or can you suggest me some other python editor that can help me debug the code too? – Code_Ninja Feb 13 '19 at 09:46
  • Notepad++ (and most text editors) don't provide debugging capabilities AFAIK, so you'd have to either use `pdb`, `gdb` to debug from the command line, or a proper Python IDE such as Pycharm, Eclipse, Spyder, you have lots to choose from. – hyperTrashPanda Feb 13 '19 at 10:36
1

Here you go :)

import sys
from mechanize import Browser, Request


browser = Browser()

browser.set_handle_equiv(True)
browser.set_handle_gzip(True)
browser.set_handle_redirect(True)
browser.set_handle_referer(True)
browser.set_handle_robots(False)

# setup your header, add anything you want
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 5.1; rv:14.0) Gecko/20100101 Firefox/14.0.1', 'Referer': 'http://whateveritis.com'}


url_list = sys.argv[1:]
for url in url_list:
    request = Request(url=url, data=None, headers=header)
    response = browser.open(request)
    print(response.read())
    response.close()
han solo
  • 6,390
  • 1
  • 15
  • 19
-1

Same as above. Guess I didn't read all the answers before digging in. LOL

import sys
import webbrowser

from mechanize import Browser
browser = Browser()

browser.set_handle_equiv(True)
browser.set_handle_gzip(True)
browser.set_handle_redirect(True)
browser.set_handle_referer(True)
browser.set_handle_robots(False)

# pretend you are a real browser
browser.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36')]

listOfSites = sys.argv[1:]
for i in listOfSites:
    webbrowser.open(i)