2

I'm trying to map this website, but I got a problem while trying to fully crawl it. I'm getting an error 404 even though the URL exists.

Here is my code:

import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

csvFile = open("C:/Users/Pichau/codigo/govbr/brasil/govfederal/govbr/arquivos/teste.txt",'wt')
paginas = set()
def getLinks(pageUrl):
    global paginas
    html = urlopen("https://www.gov.br/pt-br/"+pageUrl)
    bsObj = BeautifulSoup(html, "html.parser")
    writer = csv.writer(csvFile)
    for link in bsObj.findAll("a"):
      if 'href' in link.attrs:
       if link.attrs['href'] not in paginas:
             #nova página encontrada
                newPage = link.attrs['href']
                print(newPage)
                paginas.add(newPage)
                getLinks(newPage)
                csvRow = []
                csvRow.append(newPage)
                writer.writerow(csvRow)

   
getLinks("")
csvFile.close()  

And this is the error message I got, after I tried to run that code:

#wrapper
/
#main-navigation
#nolivesearchGadget
#tile-busca-input
#portal-footer
http://brasil.gov.br
Traceback (most recent call last):
  File "c:\Users\Pichau\codigo\govbr\brasil\govfederal\govbr\teste2.py", line 26, in <module>
    getLinks("")
  File "c:\Users\Pichau\codigo\govbr\brasil\govfederal\govbr\teste2.py", line 20, in getLinks
    getLinks(newPage)
  File "c:\Users\Pichau\codigo\govbr\brasil\govfederal\govbr\teste2.py", line 20, in getLinks
    getLinks(newPage)
  File "c:\Users\Pichau\codigo\govbr\brasil\govfederal\govbr\teste2.py", line 20, in getLinks
    getLinks(newPage)
  [Previous line repeated 4 more times]
  File "c:\Users\Pichau\codigo\govbr\brasil\govfederal\govbr\teste2.py", line 10, in getLinks
    html = urlopen("https://www.gov.br/pt-br/"+pageUrl)
  File "C:\Users\Pichau\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 214, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Users\Pichau\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 523, in open
    response = meth(req, response)
  File "C:\Users\Pichau\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 632, in http_response
    response = self.parent.error(
  File "C:\Users\Pichau\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 561, in error
    return self._call_chain(*args)
  File "C:\Users\Pichau\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 494, in _call_chain
    result = func(*args)
  File "C:\Users\Pichau\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 641, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found
PS C:\Users\Pichau\codigo\govbr>

I've tried to do it only with the main link, and it works fine, but as soon as i add the pageurl variable to the url, it gives me this error. How can I fix this error?

Xiddoc
  • 3,369
  • 3
  • 11
  • 37
  • 1
    We can't help you without knowing what `pageUrl` contains. Please, spend some time reading ["How to create a Minimal, Complete, and Verifiable example"](https://stackoverflow.com/help/mcve) and ["How do I ask a good question?"](https://stackoverflow.com/help/how-to-ask). You will get better results by following the tips in those articles. – accdias Apr 29 '21 at 04:12

1 Answers1

0

From what I can see, you're right- the page is there... for us people on browsers. What I assume is happening is some basic anti-botting mechanism which bans uncommon UserAgents, or in other words, only lets browsers view the page. However, as the User Agent is a header that we can control, we can manipulate it so it won't throw the 404 error.

I can't type out the code for it at the moment but you will need to pair this StackOverflow answer describing how to change a header in urllib, and you must write some code which takes that answer and changes the "UserAgent" header to a value like Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36, which I've taken from here.

After you've changed the UserAgent header, you should be able to download the page successfully.

Xiddoc
  • 3,369
  • 3
  • 11
  • 37
  • 1
    So, i did change the header now, thanks for the clarification about that, but now i'm geting a different error: urllib.error.URLError: – João Lucas Motta Apr 29 '21 at 15:03
  • 1
    You typed the URL wrong or the server is down. If my answer helped you, make sure to upvote it and click the checkmark button on the left of it! – Xiddoc Apr 29 '21 at 15:18