Web scraping in Python without knowing the website structure

Question

[Begginer]

This is a more theoretical question and the attempts made were in Python, a language in which I am more familiar. I got a task where I was supposed to receive multiple URLs (>200) and I was supposed to scan each and every one of them to look for a document (which has the same name in all of them) and download it. It's a lot of steps, I know. The thing is: it wouldn't make sense for me to have to manually enter each site, map its structure, build the code, run it and then get the file (the file part is not the concern now) like most of the tutorials of web scraping suggests. In this case, it would be faster for me to enter the sites manually, find and download the document, since it is a very specific information and the download will not be a recurring activity. What I need to know is if there is a way, even if it is exhaustive, to search for this file within each site. In the URLs I visited, I noticed that the internal paths (sequence of buttons to click) are almost the same, or at least go through a list of mappable names. I don't mind if I have to click every button and test every possibility, as long as it's possible.

This question is an attempt to get reading material, examples, or any hints that would let me know which way to go if that were possible.

Example 1: URL example (recalling that the structure here is not valid for all cases): https://ceagesp.gov.br/ Final URL: https://ceagesp.gov.br/acesso-a-informacao/governanca/carta-anual-de-politicas-publicas-e-governanca-corporativa/ Path to where files are concentrated on that site: Home > Acesso à Informação > Governança > Carta Anual de Políticas Públicas e Governança Corporativa > (select the desired year)

Example 2: URL example: https://www.amazul.mar.mil.br/ Final URL: https://www.amazul.mar.mil.br/transparencia-governanca-documentos-carta-consad Path to where files are concentrated on that site: Início > Carta Anual de Políticas Públicas e Governança Corporativa 2019 (Carta do Consad)

Any help would be very helpfull.

That's more like a spider's work...you can try "Carta Anual de Políticas Públicas e Governança Corporativ site:xxxx.br" in google and start from there. — lex, Mar 02 '23 at 05:17

score 0 · Answer 1 · answered Mar 01 '23 at 20:00

I think it would be good practice to use BS4 library, take a good read on how to use it. Assuming you already know how to download files if a link is given, you need to look for <a> tags (which are used for links).

However this method would probably also take a long time.

Assuming you need to visit other HTML files on the same website, again look for <a> tags which redirect you to the same host.

If I understand your question correctly, my method above would also consume some time, but should work.

Maybe someone else knows a faster method for achieving this.

score 0 · Answer 2 · answered Mar 01 '23 at 22:19

As per the two examples in the question:

Example 1: Home > Acesso à Informação > Governança > Carta Anual de Políticas Públicas e Governança Corporativa > (select the desired year)
Example 2: Início > Carta Anual de Políticas Públicas e Governança Corporativa 2019 (Carta do Consad)

To reach to the endpoint with the download link you have to navigate through half a dozen of steps clicking on different elements e.g. <a>, <input>, <span>, etc which are spanned over multiple pages and entirely 2 different websites.

Hence, the locator strategies would also be different. So to conclude, it won't be possible.

However on the other hand if there is a API endpoint available, using BeautifulSoup or Python - Requests your task will be achieved a lot more easily.

References

You can find a couple of relevant detailed discussions in:

Web scraping in Python without knowing the website structure

2 Answers2

References