Client-side javascript to extract patterns from online PDF document

Question

I am trying to extract patterns from online PDFs using a client side script (tampermonkey / greasemonkey - Firefox or Chrome). The implementation can be browser specific, would like to try get it working in either 1.

I am able to use JS to extract the content and match on it manually in Firefox (which loads pdf.js automatically). E.g. on a PDF URL:

var matchList = document.body.innerText.match(/my_regex/gi);

I am now trying to port this into Greasemonkey for a user-script:

// ==UserScript==
// @name     MyExtractor
// @version  1
// @grant    none
// @include  *.pdf
// ==/UserScript==

console.log("User script");
console.log(document.body.innerText); // this JS executed manually logs the PDF to text, but 
alert("HI");

The script doesn't load - is it possible to get a Gm script to execute on a PDF url in Firefox?

In Chrome, the PDF document seems to be embedded - so even with direct console JS, i can't seem to get access to the content. e.g.

> document.getElementsByTagName("embed")[0]
<embed name="some_id" style="position:absolute; left: 0; top: 0;" width="100%" height="100%" src="about:blank" type="application/pdf" internalid="some_id">

This is about as far as I have been able to get with Chrome - is there a way to get the PDF object based on the above element and extract text from it?

With regards to the JS, i do not necessarily need to have it run directly on the PDF url, I can also get it to identify a page that has a PDF anchor href on it, and then fetch and parse it based on a request if possible - if there is a way to fetch and process with a PDf library some how?

References used so far:

Execute a Greasemonkey script on every page, regardless of page-type (like foo.com/image.jpg)? - do i need to build an extension for this?
Extract text from pdf file using javascript (and followed some of the links) - specifically, i have tried to follow this: How to extract text from PDF in JavaSript - but have not been able to create a reference to the PDF source / add the library to GM and execute as expected - is this a good path to follow and try solve the problems I am running into?

This is the address I used for the bookmarklet (script): `javascript:alert(document.body.innerText.match(/(my_regex)/gi));` - it works in Firefox when on a PDF URL directly. — Brett, Jun 29 '22 at 13:10

Client-side javascript to extract patterns from online PDF document

0 Answers0