I am trying to extract patterns from online PDFs using a client side script (tampermonkey / greasemonkey - Firefox or Chrome). The implementation can be browser specific, would like to try get it working in either 1.
I am able to use JS to extract the content and match on it manually in Firefox (which loads pdf.js automatically). E.g. on a PDF URL:
var matchList = document.body.innerText.match(/my_regex/gi);
I am now trying to port this into Greasemonkey for a user-script:
// ==UserScript==
// @name MyExtractor
// @version 1
// @grant none
// @include *.pdf
// ==/UserScript==
console.log("User script");
console.log(document.body.innerText); // this JS executed manually logs the PDF to text, but
alert("HI");
The script doesn't load - is it possible to get a Gm script to execute on a PDF url in Firefox?
In Chrome, the PDF document seems to be embedded - so even with direct console JS, i can't seem to get access to the content. e.g.
> document.getElementsByTagName("embed")[0]
<embed name="some_id" style="position:absolute; left: 0; top: 0;" width="100%" height="100%" src="about:blank" type="application/pdf" internalid="some_id">
This is about as far as I have been able to get with Chrome - is there a way to get the PDF object based on the above element and extract text from it?
With regards to the JS, i do not necessarily need to have it run directly on the PDF url, I can also get it to identify a page that has a PDF anchor href on it, and then fetch and parse it based on a request if possible - if there is a way to fetch and process with a PDf library some how?
References used so far:
- Execute a Greasemonkey script on every page, regardless of page-type (like foo.com/image.jpg)? - do i need to build an extension for this?
- Extract text from pdf file using javascript (and followed some of the links) - specifically, i have tried to follow this: How to extract text from PDF in JavaSript - but have not been able to create a reference to the PDF source / add the library to GM and execute as expected - is this a good path to follow and try solve the problems I am running into?