4

I have posted this question after considerable research, and also after learning that the title of the question can possibly trigger suspicions on my intent in making such a crawler.

So, I think I have to describe the scenario which I face.

I am a practising advocate in India ('attorney' in some countries); who likes and does coding and Linux things. One of the biggest challenges (among a ton of other challenges) for a practising lawyer in India is having to go through a lot of menial tasks during practice. Going through daily cause lists (a list published by courts every day to show which cases are listed today) is one such task.

Courts in India do not have a uniform mechanism of listing cases. Most of them do have websites on which the cases are posted but they do not publish them in a uniform format. One website would publish them as PDFs, and some as HTML pages in a tabular form.

A law clerk spends almost a decent amount of time every day to check these cause lists to know which cases are up the next day. Sometimes, there are possibilities that the clerk may miss a posting (because of a lot of anomalies; court registry issues).

This prompted me to write a script (in Python) to make a crawler for a particular tribunal in which my previous office had a lot of cases. Then, I rewrote it in PHP.

Now, I want to add more features to it to crawl the lists of other courts. However, I have hit a roadblock because, some websites do not publish the lists in plain HTML, but PDF files. And in order to get to downloading the PDF files I have to click on buttons and submit forms.

For example, the court I have in question is the High Court of Kerala. Their cause list page is written in HTML/PHP, however, the relevant tabs and buttons, use JavaScript (which I have not at all versed with). So, I need to emulate clicking and navigating through buttons and forms (to submit the date).

Working with the PDF file obtained is another things altogether (that I think is possible).

Hence, my question.

Basil Ajith
  • 125
  • 10
  • 1
    I've used pupperteer in node, and there is a PHP bridge for it here -> https://github.com/rialto-php/puphpeteer – Keith Mar 17 '21 at 10:29
  • @Keith puphpeteer looks a bit above my level.Still will give it a try. Thank you. – Basil Ajith Mar 17 '21 at 11:34
  • What **exactly** have you tried so far? Where **exactly** are you stuck? PHP alone cannot click buttons that rely on Javascript – Nico Haase Mar 27 '21 at 10:17
  • Have you looked into using Mink, a PHP library for interacting controlling browsers and traversing websites. It can be used with Chrome (in headless mode too). Some documentation at https://mink.behat.org/en/latest/drivers/chrome.html – mickadoo Mar 27 '21 at 15:37
  • The most straight-forward way might be to consider automation tools like [axiom.ai](https://axiom.ai/) (comes with a price tag attached) or good old [Selenium](https://www.selenium.dev/) ([open source](https://github.com/SeleniumHQ/selenium)), especially [Selenium IDE](https://www.selenium.dev/selenium-ide/) - though Selenium has kind of a learning curve, it will spare you the need of re-inventing the wheel. – nosurs Apr 02 '21 at 22:18

2 Answers2

3

First of all, let's clarify: a web-crawler that is visiting some websites and performs some operations needs to

  • be able to send requests
  • parse HTML
  • execute Javascript

You will need some kind of browser, either one widely available or a browser engine. For PHP-based browser engines kindly read questions, articles to find the one that suits you best here: PHP Headless Browser?

Together with my former professor we have written an article years ago about semantic extraction. Our idea was to translate the HTML code into a semantic hierarchy, by mapping HTML rules into concepts that can be embedded one into the other. We have implemented a proof-of-concept in Javascript, which I have used in many PHP-related extraction tasks over the years, often solving large problems in hours. A very simplistic practical approach is to implement a server-side code, which I mainly did in PHP, where you would load your page, that would send a request to the target website and would transmit the response to your browser. With this kind of proxying you establish a connection to the website you intend to mine in such a way that its HTML will be sent back to your browser.

Problems that you will face:

  • relative URLs for CSS files
  • relative URLs for Javascript files
  • requests being sent via Javascript to relative URLs
  • relative URLs for other resources, i.e. images

You will need to fix all these URL discrepancies so that the page you load is properly functional.

When the page is functional via your proxy, you can implement your own Javascript that is to be executed after the page has loaded and add that Javascript to be loaded using your proxy. This will ensure that your script - whatever it may be - will perform the actions you need.

Now, what the Javascript to be executed should be? Basically you can experiment on the original site using Dev Tools.

For example: you know that you need to log in, which involves filling username/password and clicking on a button. You may need a script which looks like this, but this differs from site to site:

document.getElementById("username").value = "myusername";
document.getElementById("password").value = "mypassword";
document.getElementById("login").click();

So, you create a script with content similar to the above and you ensure that it's done when the page is loaded. Naturally, you will probably need to implement navigation and the actual extraction. Since you need a file in your case, it will appear in your Downloads folder. It is out of scope in this answer, so I will focus on the extraction. You may also need to repeatedly go through a paging or a forever scroll, you can trigger events in Javascript accordingly.

If, for some reason this simplistic approach is not an option for you, then still, the idea should work using the browser engine you choose to use via PHP, the downside being that often you are not able to visually test.

You will not necessarily need a professional system such as the one we propose in the paper, but if you do, then you might want to read the paper we have written and other similar papers. To sum up the idea behind the semantic tree, you might define some abstract navigation and/or extraction rules for the concepts you have that will work accross several sites sharing the same concept and a conceptual change is less likely/frequent than structural changes in websites. So you define your general rules that can be overriden in case of special cases. For example, if you crawl 100 different websites, all of which are having some kind of tables with paging and useful records, all having some detail link, then the concept can be formulated generally and your problem on site-to-site level is simplified to "how to apply my general concept to this particular structure", which often involves just a few lines of JS per site, while the inner logic of what you do with the extracted data remains quasi similar accross sites.

Lajos Arpad
  • 64,414
  • 37
  • 100
  • 175
1

Sometimes you can just look at the network activity in the browser's developer tools and see what URLs it is making.

For example I see a request for hckerala.gov.in/causelist.php.

You can make this request in your web crawler or in the terminal, for example, with cURL:

curl https://hckerala.gov.in/causelist.php -H 'User-Agent: Mozilla/5.0 (Linux)' -H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' -H 'X-Requested-With: XMLHttpRequest' -H 'Origin: https://hckerala.gov.in' -H 'Referer: https://hckerala.gov.in/causelist.php' -H 'Pragma: no-cache' -H 'Cache-Control: no-cache' --data-raw 'type=fetchlistbyjudge&judge=159&list_date=07-04-2021'

And here's what data I get back:

{"VC (HONOURABLE MRS. JUSTICE ANU SIVARAMAN) Chamber list 2":{"cases":{"KLHC010025852021":{"cino":"KLHC010025852021","case_details":"WP(C) 935\/2021","pet_name":"V.S. HARIKUMAR","res_name":"THE SECRETARY TO GOVERNMENT","pet_adv":["S.V.PREMAKUMARAN NAIR","SRI.R.T.PRADEEP","SMT.M.BINDUDAS","SRI.K.C.HARISH"],"res_adv":["SRI.JIBU P THOMAS","SRI.V.A.MUHAMMED","SRI.M.SAJJAD","GOVERNMENT PLEADER- SERVED ON"],"regcase_type":157,"reg_no":935,"reg_year":2021,"room_no":"VC                  ","cheader":" BY VIDEO CONFERENCING\r\nNOTE: 1. PARTICIPANTS WHO ARE ATTENDING THE VIDEO CONFERENCING\r\nSHOULD JOIN BY 10.00 AM\r\n2. THE ADVOCATES SHOULD FOLLOW THE PRESCRIBED DRESS CODE,\r\nWITH OR WITHOUT ROBES AND GOWN","cfooter":"","for_bench_id":4274,"originalsr_no":0,"main_sr_no":138,"clink_code":"215700019342021","main_case":
Steven Almeroth
  • 7,758
  • 2
  • 50
  • 57