1

So i started learning web scraping in python using urllib and bs4,

I was searching for a code to analyze and i found this:- https://stackoverflow.com/a/38620894/14252018 here is the code:-

from urllib.parse import urlencode, urlparse, parse_qs

from lxml.html import fromstring
from requests import get

raw = get("https://www.google.com/search?q=StackOverflow").text
page = fromstring(raw)

for result in page.cssselect(".r a"):
    url = result.get("href")
    if url.startswith("/url?"):
        url = parse_qs(urlparse(url).query)['q']
    print(url[0])

When i try to run this it does not print anything

I named it as webparse.py

So then i tried using bs4 and this time i chose https://www.duckduckgo.com

and changed the code to this:-

import bs4 as bs
import urllib.request

sauce = urllib.request.urlopen('https://duckduckgo.com/?q=dinosaur&t=h_&ia=web').read()
soup = bs.BeautifulSoup(sauce, 'lxml')

print(soup.get_text())

I got an error:-

  1. Why didn't the first block of code run?
  2. why did the second block of code gave me an error? and what does that error mean?
Praveen
  • 106
  • 1
  • 11
  • Perhaps try `cssselect(".r.a")` if you're searching for elements with class="r a" or class="a r" –  Sep 11 '20 at 14:22
  • and why did the second bloack of code gave an error, and what does that mean? – Praveen Sep 11 '20 at 14:30
  • Why do you assume that the duckduckgo message was an error? The message just shows that duckduckgo detected that javascript is not understood and that duckduckgo is redirecting you to a different page. –  Sep 11 '20 at 14:33
  • But it did not print anything other than that – Praveen Sep 11 '20 at 14:35
  • What else did you expect the 2nd block of code to print out? –  Sep 11 '20 at 14:36
  • The whole html code other than the tags – Praveen Sep 11 '20 at 14:38
  • "The whole html code other than the tags" but that only appears if you follow the redirect which you did not do. `urllib.request.urlopen` does not follow redirects automatically. By default, `requests.get` will follow redirects. –  Sep 11 '20 at 14:40
  • Yeah done but how can i find the tags? – Praveen Sep 11 '20 at 14:56
  • The second block of code works in some websites – Praveen Sep 11 '20 at 15:21

1 Answers1

0

Change your duckduckgo URL to where the site tries to redirect you when javascript is not enabled.

import bs4 as bs
import urllib.request

# url = 'https://duckduckgo.com/?q=dinosaur&t=h_&ia=web' # uses javascript
url = 'https://html.duckduckgo.com/html?q=dinosaur' # no javascript

sauce = urllib.request.urlopen(url).read()
soup = bs.BeautifulSoup(sauce, 'lxml')

print(soup.get_text())


  • 1
    Because nothing matched your CSS selector. Google shows different pages depending on whether javascript is enabled or not. Neither urllib nor requests do javascript. –  Sep 11 '20 at 15:37