Error while parsing google search result using urllib in python

Question

So i started learning web scraping in python using urllib and bs4,

I was searching for a code to analyze and i found this:- https://stackoverflow.com/a/38620894/14252018 here is the code:-

from urllib.parse import urlencode, urlparse, parse_qs

from lxml.html import fromstring
from requests import get

raw = get("https://www.google.com/search?q=StackOverflow").text
page = fromstring(raw)

for result in page.cssselect(".r a"):
    url = result.get("href")
    if url.startswith("/url?"):
        url = parse_qs(urlparse(url).query)['q']
    print(url[0])

When i try to run this it does not print anything

So then i tried using bs4 and this time i chose https://www.duckduckgo.com

and changed the code to this:-

import bs4 as bs
import urllib.request

sauce = urllib.request.urlopen('https://duckduckgo.com/?q=dinosaur&t=h_&ia=web').read()
soup = bs.BeautifulSoup(sauce, 'lxml')

print(soup.get_text())

I got an error:-

Why didn't the first block of code run?
why did the second block of code gave me an error? and what does that error mean?

Perhaps try `cssselect(".r.a")` if you're searching for elements with class="r a" or class="a r" — , Sep 11 '20 at 14:22
and why did the second bloack of code gave an error, and what does that mean? — Praveen, Sep 11 '20 at 14:30
Why do you assume that the duckduckgo message was an error? The message just shows that duckduckgo detected that javascript is not understood and that duckduckgo is redirecting you to a different page. — , Sep 11 '20 at 14:33
What else did you expect the 2nd block of code to print out? — , Sep 11 '20 at 14:36
"The whole html code other than the tags" but that only appears if you follow the redirect which you did not do. `urllib.request.urlopen` does not follow redirects automatically. By default, `requests.get` will follow redirects. — , Sep 11 '20 at 14:40

score 0 · Accepted Answer · answered Sep 11 '20 at 15:29

0

Change your duckduckgo URL to where the site tries to redirect you when javascript is not enabled.

import bs4 as bs
import urllib.request

# url = 'https://duckduckgo.com/?q=dinosaur&t=h_&ia=web' # uses javascript
url = 'https://html.duckduckgo.com/html?q=dinosaur' # no javascript

sauce = urllib.request.urlopen(url).read()
soup = bs.BeautifulSoup(sauce, 'lxml')

print(soup.get_text())

answered Sep 11 '20 at 15:29

1

Because nothing matched your CSS selector. Google shows different pages depending on whether javascript is enabled or not. Neither urllib nor requests do javascript. – Sep 11 '20 at 15:37

Error while parsing google search result using urllib in python

1 Answers1