0

I am trying to extract patterns from online PDFs using a client side script (tampermonkey / greasemonkey - Firefox or Chrome). The implementation can be browser specific, would like to try get it working in either 1.

I am able to use JS to extract the content and match on it manually in Firefox (which loads pdf.js automatically). E.g. on a PDF URL:

var matchList = document.body.innerText.match(/my_regex/gi);

I am now trying to port this into Greasemonkey for a user-script:

// ==UserScript==
// @name     MyExtractor
// @version  1
// @grant    none
// @include  *.pdf
// ==/UserScript==

console.log("User script");
console.log(document.body.innerText); // this JS executed manually logs the PDF to text, but 
alert("HI");

The script doesn't load - is it possible to get a Gm script to execute on a PDF url in Firefox?

In Chrome, the PDF document seems to be embedded - so even with direct console JS, i can't seem to get access to the content. e.g.

> document.getElementsByTagName("embed")[0]
<embed name=​"some_id" style=​"position:​absolute;​ left:​ 0;​ top:​ 0;​" width=​"100%" height=​"100%" src=​"about:​blank" type=​"application/​pdf" internalid=​"some_id">​

This is about as far as I have been able to get with Chrome - is there a way to get the PDF object based on the above element and extract text from it?

With regards to the JS, i do not necessarily need to have it run directly on the PDF url, I can also get it to identify a page that has a PDF anchor href on it, and then fetch and parse it based on a request if possible - if there is a way to fetch and process with a PDf library some how?

References used so far:

Brett
  • 5,690
  • 6
  • 36
  • 63
  • I am able to use a match extractor in bookmarklet. – Brett Jun 28 '22 at 22:20
  • This is the address I used for the bookmarklet (script): `javascript:alert(document.body.innerText.match(/(my_regex)/gi));` - it works in Firefox when on a PDF URL directly. – Brett Jun 29 '22 at 13:10

0 Answers0