0

so I have some software who uses webscraping, but for some reason it doesn't seem to work. It's bizarre because when I run it in Google Colab, the code works fine and the url's can open and be scraped, but when I run it in my web application (and run it on my console using python3 run.py) it doesn't work.

Here is the code that is returning errors :

      b = searchgoogle(query, num)
      c = []
      print(b)
      for i in b:
          extractor = extractors.ArticleExtractor()
          headers = {
              'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) '
                            'Chrome/50.0.2661.102 Safari/537.36'
          }
          extractor = extractors.ArticleExtractor()
          req = Request(url=i, headers=headers)
          d = urlopen(req)
          try:
            if d.info()['content-type'].startswith('text/html'):
                  print ('its html')
                  resp = requests.get(i, headers=headers)
                  if resp.ok:
                      doc = extractor.get_content(resp.text)
                      c.append(comparetexts(text,doc,i))
                  else:
                      print(f'Failed to get URL: {resp.status_code}')
            else:
                    print ('its not html')
          except KeyError:
            print( 'its not html')
          print(i)
  return c

The code returning errors is the "d = urlopen(req)"

There is code above the section I just put here but it has nothing to do with the errors. Anyways, thanks for your time!

(By the way, I checked my OPEN SSL version on python3 and it says : 'OpenSSL 1.1.1m 14 Dec 2021' so I think it's up to date)

1 Answers1

0

This happens because your web application does not have SSL certification, so you should tell your script to ignore SSL verification when making the request, as specified here: Python 3 urllib ignore SSL certificate verification

Adid
  • 1,504
  • 3
  • 13
  • This didn't work, I replaced `req = Request(url=i, headers=headers) d=urlopen(req)` with : `req = requests.get(url=i, headers=headers,verify=False) d = urlopen(req)`, but now I'm getting the error : `ssl.SSLError: [SSL: DH_KEY_TOO_SMALL] dh key too small (_ssl.c:997)` – SalvorHardin Feb 06 '22 at 06:23
  • Have you tried not using urlopen? There are ways to find the content type without it: `req.headers['content-type']` – Adid Feb 06 '22 at 07:43