Search/Filter/Select/Manipulate data from a website using Python

Question

I'm working on a project that basically requires me to go to a website, pick a search mode (name, year, number, etc), search a name, select amongst the results those with a specific type (filtering in other words), pick the option to save those results as opposed to emailing them, pick a format to save them then download them by clicking the save button.

My question is, is there a way to do those steps using a Python program? I am only aware of extracting data and downloading pages/images, but I was wondering if there was a way to write a script that would manipulate the data, and do what a person would manually do, only for a large number of iterations.

I've thought of looking into the URL structures, and finding a way to generate for each iteration the accurate URL, but even if that works, I'm still stuck because of the "Save" button, as I can't find a link that would automatically download the data that I want, and using a function of the urllib2 library would download the page but not the actual file that I want.

Any idea on how to approach this? Any reference/tutorial would be extremely helpful, thanks!

EDIT: When I inspect the save button here is what I get: Search Button

Do they provide an API? if yes, please use that. If not, then your web scrapping approach seems fine. I suggest the `requests` module of Python. — hridayns, Jul 03 '17 at 04:35
Use Python `requests` and Beautiful Soup https://www.crummy.com/software/BeautifulSoup/ — mjsqu, Jul 03 '17 at 04:43
If Save button is part of a form then you have to sent a GET/POST request with corresponding parameters — gout, Jul 03 '17 at 05:08
@code_byter It's the library of congress so they should provide an API if I'm not mistaken. The thing is, I don't have much experience working with API and requests, especially if I'm going to have to do more than just retrieve data. I need the script to select things based on the type for filtering purposes, select an action by clicking on the Save button instead of the cancel button. Does API or requests or Beautiful Soup or Selenium allow me to manipulate and work on a website that way, or are they limited in that extent? — Lynn Bou Nassif, Jul 03 '17 at 06:29
@LynnBouNassif The API will allow you to retrieve the data you need without clicking on a save button programmatically. In my experience, trying to web scrap is far more cumbersome than just using an API. Especially if the API has good documentation. Now I think I'm still not understanding what exactly you want to save. Can you clarify? If you want to save search results using some sort of filter, there is probably already a way to set filter parameters in the API and get a response in JSON or XML. — hridayns, Jul 03 '17 at 06:47
@LynnBouNassif I do not know if this helps: https://stackoverflow.com/questions/13667361/how-to-retrieve-books-information-in-xml-json-from-library-of-congress-by-isbn — hridayns, Jul 03 '17 at 06:49
@code_byter I'll try to be more specific: what I want to do is go in the website, search for a name (so far, can be done with just an URL), but then I am given a list of results and I want to click on a specific result (could still be done with the URL as well). Then after clicking on that result, I am given a few records to select (can select 1, 2, as much as I want, or all of them). After selecting, I have a few options to click on: save, email, etc. Once I click on "Save" I am asked what format (drop-down menu) and I can either confirm by clicking "save" or cancel. — Lynn Bou Nassif, Jul 03 '17 at 06:55
@code_byter The problem is that I don't just want to retrieve information I want the script to do the steps I've written down for me. Using a function of urllib2 doesn't work with me because it saves the contents of the whole web page, not of the record specifically. To save the record, there is an option on the web page to save it, and choosing that option leads me to a new page where I have to pick the format then click save again. When I click save for some reason I don't find any path (or basically any link that would automatically download the record just by entering that link). — Lynn Bou Nassif, Jul 03 '17 at 06:59
@LynnBouNassif Sounds like you are using a web scrapping tool like ParseHub or Kimono. If you are looking to do it this way, then I suggest looking at the guide for the particular tool to see how to automate button clicks. But if you are going to go with the API (assuming it exists), then that would be much easier in my opinion, since I more experience with APIs than web scraping. With an API, you use the right endpoint for the search, and it returns the results in whatever format you ask it to (in the parameters), but it depends a lot on the how the Target has made their API. — hridayns, Jul 03 '17 at 06:59
@LynnBouNassif oh I see, the Save button was probably written in Javascript, and as such has no link to download from. Have you tried using the `Inspect element` to check what link the button refers to in in its javascript? — hridayns, Jul 03 '17 at 07:02
@LynnBouNassif Can you please try to find out the content of the records.mrc file? Also, try to find out the javascript files used on the page. There may be some information there. — hridayns, Jul 04 '17 at 07:20
@code_byter The content consists of records written in MARC format. I'm using the "Inspect Element" to try and find the link the button refers to but it doesn't look like it refers to anything weirdly. Is there a way to look at the javascript files in another way? — Lynn Bou Nassif, Jul 04 '17 at 08:39
@LynnBouNassif What I understand from the Inspect element screenshot is that the `form action` is equal to the `records.mrc`, which means your `Save` button is submitting it to that file. You can see the attribute `type`of the `Save` button. The Javascript files used on a webpage are usually loaded inside the `head` tag at the top of the page. — hridayns, Jul 04 '17 at 15:34

score 0 · Answer 1 · answered Jul 03 '17 at 05:07

This would depend a lot on the website your targeting and how their search is implemented.

For some websites, like Reddit, they have an open API where you can add a .json extension to a URL and get a JSON string response as opposed to pure HTML.

For using a REST API or any JSON response, you can load it as a Python dictionary using the json module like this

import json


json_response = '{"customers":[{"name":"carlos", "age":4}, {"name":"jim", "age":5}]}'
rdict = json.loads(json_response)

def print_names(data):
  for entry in data["customers"]:
    print(entry["name"])

print_names(rdict)

score 0 · Answer 2 · answered Jul 04 '17 at 12:56

You should take a look at the Library of Congress docs for developers. If they have an API, you'll be able to learn about how you can do search and filter through their API. This will make everything much easier than manipulating a browser through something like Selenium. If there's an API, then you could easily scale your solution up or down.

If there's no API, then you have

Use Selenium with a browser(I prefer Firefox)
Try to get as much info generated, filtered, etc. without actually having to push any buttons on that page by learning how their search engine works with GET and POST requests. For example, if you're looking for books within a range, then manually conduct this search and look at how the URL changes. If you're lucky, you'll see that your search criteria is in the URL. Using this info you can actually conduct a search by visiting that URL which means your program won't have to fill out a form and push buttons, drop-downs, etc.
If you have to use the browser through Selenium(for example, if you want to save the whole page with html, css, js files then you have to press ctrl+s then click "save" button), then you need to find libraries that allow you to manipulate the keyboard within Python. There are such libraries for Ubuntu. These libraries will allow you to press any keys on the keyboard and even do key combinations.

An example of what's possible:

I wrote a script that logs me in to a website, then navigates me to some page, downloads specific links on that page, visits every link, saves every page, avoids saving duplicate pages, and avoids getting caught(i.e. it doesn't behave like a bot by for example visiting 100 pages per minute).

The whole thing took 3-4 hours to code and it actually worked in a virtual Ubuntu machine I had running on my Mac which means while it was doing all that work I could do use my machine. If you don't use a virtual machine, then you'll either have to leave the script running and not interfere with it or make a much more robust program that IMO is not worth coding since you can just use a virtual machine.

Search/Filter/Select/Manipulate data from a website using Python

2 Answers2