I have been using an adaptation of code from these posts:
PDF to Text extractor in nodejs without OS dependencies
pdfjs: get raw text from pdf with correct newline/withespace
to convert pdfs to text:
import pdfjsLib from 'pdfjs-dist/legacy/build/pdf.js';
import {
TextItem,
DocumentInitParameters,
} from 'pdfjs-dist/types/src/display/api';
const getPageText = async (pdf: pdfjsLib.PDFDocumentProxy, pageNo: number) => {
const page = await pdf.getPage(pageNo);
const tokenizedText = await page.getTextContent();
var textItems = tokenizedText.items;
var finalString = '';
var line = 0;
// Concatenate the string of the item to the final string
for (var i = 0; i < textItems.length; i++) {
if (line != (textItems[i] as TextItem).transform[5]) {
if (line != 0) {
finalString += '\r\n';
}
line = (textItems[i] as TextItem).transform[5];
}
var item = textItems[i];
finalString += (item as TextItem).str;
}
return finalString;
};
export const getPDFText = async (
data: string,
password: string | undefined = undefined
) => {
const initParams: DocumentInitParameters = {
data: Buffer.from(data, 'base64'),
//useSystemFonts: true,
//disableFontFace: false,
standardFontDataUrl: 'standard_fonts/'
};
if (password !== undefined) {
initParams.password = password;
}
const pdf = await pdfjsLib.getDocument(initParams).promise;
const maxPages = pdf.numPages;
const pageTextPromises = [];
for (let pageNo = 1; pageNo <= maxPages; pageNo += 1) {
pageTextPromises.push(getPageText(pdf, pageNo));
}
const pageTexts = await Promise.all(pageTextPromises);
const joined = pageTexts.join(' ');
return joined;
};
With version 3.1.81 of pdfjs-dist this looks pretty good, but checkboxes on form fields are lost and text field's values show up at the end of each page instead of remaining in context. I feel like this page: https://pdftotext.com/ uses pdfjs based on similarities with my output, but they get the checks on the boxes and their text field "answers" are by the question.
Run with:
import { join } from 'path';
import { readFileSync } from 'fs';
const rawContents = readFileSync(join('directory', 'file.pdf'), 'base64');
const pdfText = await getPDFText(rawContents as string);
Anyone have an idea why I am losing the checks (the boxes are there)?
Sample of what I get:
22. when something something?
☐ 0-3 months ago
☐ 4-6 months ago
☐ 7-12 months ago
☐ 13-18 months ago
☐ 19-24 months ago
☐ 25-60 months ago
☐ I don't know
here is what that webpage gets:
22. when something something?
✔ 0-3 months ago
☐
☐ 4-6 months ago
☐ 7-12 months ago
☐ 13-18 months ago
☐ 19-24 months ago
☐ 25-60 months ago
☐ I don’t know
Again, my output looks like theirs but has lost these checks. I don't know for sure they use pdfjs but i think they do.
Note that I have downloaded a put a couple fonts in the standard_fonts directory. Should I copy them all even if I see no warning message?