I have posted this question after considerable research, and also after learning that the title of the question can possibly trigger suspicions on my intent in making such a crawler.
So, I think I have to describe the scenario which I face.
I am a practising advocate in India ('attorney' in some countries); who likes and does coding and Linux things. One of the biggest challenges (among a ton of other challenges) for a practising lawyer in India is having to go through a lot of menial tasks during practice. Going through daily cause lists (a list published by courts every day to show which cases are listed today) is one such task.
Courts in India do not have a uniform mechanism of listing cases. Most of them do have websites on which the cases are posted but they do not publish them in a uniform format. One website would publish them as PDFs, and some as HTML pages in a tabular form.
A law clerk spends almost a decent amount of time every day to check these cause lists to know which cases are up the next day. Sometimes, there are possibilities that the clerk may miss a posting (because of a lot of anomalies; court registry issues).
This prompted me to write a script (in Python) to make a crawler for a particular tribunal in which my previous office had a lot of cases. Then, I rewrote it in PHP.
Now, I want to add more features to it to crawl the lists of other courts. However, I have hit a roadblock because, some websites do not publish the lists in plain HTML, but PDF files. And in order to get to downloading the PDF files I have to click on buttons and submit forms.
For example, the court I have in question is the High Court of Kerala. Their cause list page is written in HTML/PHP, however, the relevant tabs and buttons, use JavaScript (which I have not at all versed with). So, I need to emulate clicking and navigating through buttons and forms (to submit the date).
Working with the PDF file obtained is another things altogether (that I think is possible).
Hence, my question.