PDF.js returns text contents of the whole Document as each Page's textContent

Question

I'm building a client-side app that uses PDF.js to parse the contents of a selected PDF file, and I'm running into a strange issue.

Everything seems to be working great. The code successfully loads the PDF.js PDF object, which then loops through the Pages of the document, and then gets the textContent for each Page.

After I let the code below run, and inspect the data in browser tools, I'm noticing that each Page's textContent object contains the text of the entire document, not ONLY the text from the related Page.

Has anybody experienced this before?

I pulled (and modified) most of the code I'm using from PDF.js posts here, and it's pretty straight-forward and seems to perform exactly as expected, aside from this issue:

testLoop: function (event) {
    var file = event.target.files[0];
    var fileReader = new FileReader();
    fileReader.readAsArrayBuffer(file);
    fileReader.onload = function () {
        var typedArray = new Uint8Array(this.result);
        PDFJS.getDocument(typedArray).then(function (pdf) {
            for(var i = 1; i <= pdf.numPages; i++) {
                pdf.getPage(i).then(function (page) {
                    page.getTextContent().then(function (textContent) {
                        console.log(textContent);
                    });
                });
            }
        });
    }
},

Additionally, the size of the returned textContent objects are slightly different for each Page, even though all of the objects share a common last object - the last bit of text for the whole document.

Here is an image of my inspector to illustrate that the objects are all very similarly sized.

Through manual inspection of the objects in the inspector shown, I can see that the data from, Page #1, for example, should really only consist of about ~140 array items, so why does the object for that page contain ~700 or so? And why the variation?

What are the items? Without seeing the array or the pdf, it's hard to be sure, but I've run across documents where the items aren't in simple word tokens. For (good, bad, random, irrelevant) reasons, the pdf could be built in such a way it split many times on whitespace or mid words. I've always joined the arrays and performed my own tokenizing. A simple `.join().split(/\s+/)` might be worth a try. — user01, Jul 09 '16 at 02:17
This answer has a very good example on how to use PDF.js to extract text contents: http://stackoverflow.com/a/20522307/6481438 — GCSDC, Jul 09 '16 at 02:50
@user01 Thanks for the reply - your point about the PDF being composed strangely helped a lot! — ineedhelp, Jul 09 '16 at 16:02
@GCSDC Yeah, that's actually one of the posts here I used to create my script - it's a good one. Thanks for posting the link, for posterity :) — ineedhelp, Jul 09 '16 at 16:03

score 0 · Accepted Answer · answered Jul 09 '16 at 15:57

It looks like the issue here is the formatting of the PDF document I'm trying to parse. The PDF contains government records in a tabular format, which apparently was not composed according to modern PDF standards.

I've tested the script with different PDF files (which I know are properly composed), and the Page textContent objects returned are correctly split based on the content of the Pages.

In case anyone else runs into this issue in the future, there are at least two possible ways to handle the problem, as far as I have imagined so far:

Somehow reformat the malformed PDF to use updated standards, then process it. I don't know how to do this, nor am I sure it's realistic.
Select the largest of the returned Page textContent objects (since they all contain more or less the full text of the document) and do your operations on that textContent object.

PDF.js returns text contents of the whole Document as each Page's textContent

1 Answers1