and thanks in advance! I was hoping someone might be able to point me in the right direction as to how to scrape a searchable online database. Here is the url: https://hord.ca/projects/eow/. If possible, I'd like to be able to access all of the data from the site's database, I'm just not sure how to access it using bs4... Maybe bs4 isn't the answer here though. Still a relatively new Pythonista, any help is greatly appreciated!
-
It might be easiest to contact the developer and ask for a dump of the database: https://hord.ca/projects/eow/about.php – Ben May 24 '17 at 04:07
-
That's kinda what I figured. I was hoping the opposite might be true. I appreciate it! – Sloan Bell May 24 '17 at 04:10
-
Now, that being said, it certainly looks doable to scrape data from the site. If that's the approach you want to take. If you do go this route, make sure you don't make too many requests to the site too fast so you don't take it down or make the admin angry. https://stackoverflow.com/a/1825465/2958070 is a link to help get you started downloading the page and pointing towards bs4's site (it's perfect for this as you suspect). – Ben May 24 '17 at 04:17
1 Answers
Since you are new there are going to be a combination of things you need to address, you need to have a good handle on where to look in html, make sure you understand how the site works, what does it put into its URLs, and why? what are the class names of the important bits of the site you will want to reference? and how does it handle multipage display (if it does so at all). once you are intimate with the website you are scraping you will need to apply that knowledge when you go to make your automation.
for beginners id highly reccomend this ebook: https://automatetheboringstuff.com/
its a great read and easy to follow for even the beginner in both python and html. even better its free to read on the site!
chapter 11 is the part you are specifically looking for on webscraping. which will give you the rundown on what you need to be looking for and how to go about planning your code.
but i highly recommend you read the whole thing once you are done focusing on your current project.

- 258
- 3
- 11
-
I appreciate all of the help! I'm actually reading through the chapter right now as we speak! I do have one last question though, if you inspect the source code on the site, the form 'result.php' is outputting information gathered from a database somewhere in the website's directory. Would it be possible to scrape the actual location of the database? – Sloan Bell May 24 '17 at 14:54
-
the short answer is maybe. the medium answer is now you are stepping into the realm of penetration testing. "scraping" a database, isn't exactly scraping. there is a gray line between what we are allowed to collect as the agreed upon "common area" of any website. but one thing that is usually common amongst all admins, they don't want you inside their databases unless they explicitly say so. if you do find a way in, this is usually considered more of a vulnerability than innocent scrapability. not considering the admin's wishes could at best get you banned, and at worst prosecuted. – Nalaurien May 24 '17 at 22:32
-
as for if its possible to get data from a database through a php file? usually no. at least unless they did some horrific security mistake. the way its **supposed** to work is: php gets interpreted on the server and sends you an html page with all the info from the **serverside** interpretation. the key to php is that it is run serverside and gives you something else. thats why its useful. so even though it has access to the database, that doesnt mean the page you are looking at has a link to it at all. it could just be the text retreived from it and nothing more. – Nalaurien May 24 '17 at 22:35
-
That was an extremely helpful explanation, Nalaurien. Thank you for taking some time to respond. I definitely have no intention of pen testing this site, I'll mess around with selenium a bit and see if I write program to automate some data retrieval. Thank you very much for the responses everyone! – Sloan Bell May 25 '17 at 04:30