3

I have been using an adaptation of code from these posts:

PDF to Text extractor in nodejs without OS dependencies

pdfjs: get raw text from pdf with correct newline/withespace

to convert pdfs to text:

import pdfjsLib from 'pdfjs-dist/legacy/build/pdf.js';

import {
    TextItem,
    DocumentInitParameters,
} from 'pdfjs-dist/types/src/display/api';

const getPageText = async (pdf: pdfjsLib.PDFDocumentProxy, pageNo: number) => {
    const page = await pdf.getPage(pageNo);
    const tokenizedText = await page.getTextContent();
    var textItems = tokenizedText.items;
    var finalString = '';
    var line = 0;

    // Concatenate the string of the item to the final string
    for (var i = 0; i < textItems.length; i++) {
        if (line != (textItems[i] as TextItem).transform[5]) {
            if (line != 0) {
                finalString += '\r\n';
            }

            line = (textItems[i] as TextItem).transform[5];
        }
        var item = textItems[i];

        finalString += (item as TextItem).str;
    }
    return finalString;
};

export const getPDFText = async (
    data: string,
    password: string | undefined = undefined
) => {
    const initParams: DocumentInitParameters = {
         data: Buffer.from(data, 'base64'),
        //useSystemFonts: true,
        //disableFontFace: false,
        standardFontDataUrl: 'standard_fonts/'
    };

    if (password !== undefined) {
        initParams.password = password;
    }

    const pdf = await pdfjsLib.getDocument(initParams).promise;
    const maxPages = pdf.numPages;
    const pageTextPromises = [];
    for (let pageNo = 1; pageNo <= maxPages; pageNo += 1) {
        pageTextPromises.push(getPageText(pdf, pageNo));
    }
    const pageTexts = await Promise.all(pageTextPromises);
    const joined = pageTexts.join(' ');
    return joined;
};

With version 3.1.81 of pdfjs-dist this looks pretty good, but checkboxes on form fields are lost and text field's values show up at the end of each page instead of remaining in context. I feel like this page: https://pdftotext.com/ uses pdfjs based on similarities with my output, but they get the checks on the boxes and their text field "answers" are by the question.

Run with:

import { join } from 'path';
import { readFileSync } from 'fs';

const rawContents = readFileSync(join('directory', 'file.pdf'), 'base64');

const pdfText = await getPDFText(rawContents as string);

Anyone have an idea why I am losing the checks (the boxes are there)?

Sample of what I get:

22. when something something?
☐ 0-3 months ago
☐ 4-6 months ago
☐ 7-12 months ago
☐ 13-18 months ago
☐ 19-24 months ago
☐ 25-60 months ago
☐ I don't know

here is what that webpage gets:

22. when something something?

✔ 0-3 months ago
☐
☐ 4-6 months ago

☐ 7-12 months ago

☐ 13-18 months ago

☐ 19-24 months ago

☐ 25-60 months ago

☐ I don’t know

Again, my output looks like theirs but has lost these checks. I don't know for sure they use pdfjs but i think they do.

Note that I have downloaded a put a couple fonts in the standard_fonts directory. Should I copy them all even if I see no warning message?

chrismead
  • 2,163
  • 3
  • 24
  • 36
  • Yeah, fair enough. I was just comparing against that site because its output is similar to mine, and I think pdf.js should support checkboxes – chrismead Dec 12 '22 at 21:47

2 Answers2

2

In forms Check Boxes are a field boundary not part of any nearby text (true of all fields they are not directly connected to their description), they simply have a name and value, Here Check Box1 & Box2 are placed and Box3 is awaiting surface appearance.

NOTE especially they are not of fixed appearance they morph when displayed they are chimera looking like they are present.

enter image description here

In these AcroForm cases they have no native plain text equivalence, there is nothing to detect the index is simply pointing to page co-ordinates.

PDF.js is a PDF2HTML converter so can easily ! display those indexed areas as html fields,
NOTE ITS AN X

enter image description here

In terms of PDF extractable surface there is no text, and we can see for the boxes above and below there is only a description as seen alongside those radio boxes

NOTE ITS A TICK nothing differs except the displayer (viewer)

enter image description here

If we try to extract text using PDF.js (here in browser) we get just the text

enter image description here

In some cases where Symbol or ZapfDingbats native fonts or other TTF with those code points have been embeded and adapted for state it may be possible to get a fonted checkmark symbol but it is rare, except when designed especially.

☐ as you see in your case then to replace with one
☑ is picking the correct one from font and add as
☒ replacement its not very easy but doable.

so the above symbols via html print as pdf may be extracted again as here using simple pdftotext or python

enter image description here

☐ as you see in your case then to replace with one
☑ is picking the correct one from font and add as
☒ replacement its not very easy but doable.
K J
  • 8,045
  • 3
  • 14
  • 36
  • well, not the answer i was hoping for, but very thorough. just gonna wait a little before marking it as accepted just in case but looks like you know what you are talking about – chrismead Dec 13 '22 at 16:52
  • if these pdfs are from my company, do you think our devs could embed symbol fonts? – chrismead Dec 13 '22 at 16:53
1

For anyone else out there looking:

https://formulae.brew.sh/formula/poppler

this includes pdftotext command which gets checkmarks

EDIT After digging in further, I definitely like pdftotext from poppler. It does have one oddity where a line that wraps on a dash will be unwrapped minus the dash. I think its trying to be smart and assume the dash is there to indicate a wrap. Pretty much an edge case, but worth noting.

There is also a node wrapper which saves you from having to deal with temporary files: https://www.npmjs.com/package/node-pdftotext

chrismead
  • 2,163
  • 3
  • 24
  • 36
  • I will add some details once I dig in. So far it definitely doesn't look perfect, but may be useable. The check is present next to the appropriate text but the box it would go in is also shown on the next line. – chrismead Jan 11 '23 at 15:39