PDF to Text extractor in nodejs without OS dependencies

Question

Is there a way to extract text from PDFs in nodejs without any OS dependencies (like pdf2text, or xpdf on windows)? I wasn't able to find any 'native' pdf packages in nodejs. They always are a wrapper/util on top of an existing OS command. Thanks

score 10 · Answer 1 · edited Feb 06 '21 at 23:51

10

After some work, I finally got a reliable function for reading text from PDF using https://github.com/mozilla/pdfjs-dist

To get this to work, first npm install on the command line:

npm i pdfjs-dist

Then create a file with this code (I named the file "pdfExport.js" in this example):

const pdfjsLib = require("pdfjs-dist");

async function GetTextFromPDF(path) {
    let doc = await pdfjsLib.getDocument(path).promise;
    let page1 = await doc.getPage(1);
    let content = await page1.getTextContent();
    let strings = content.items.map(function(item) {
        return item.str;
    });
    return strings;
}
module.exports = { GetTextFromPDF }

Then it can simply be used in any other js file you have like so:

const pdfExport = require('./pdfExport');
pdfExport.GetTextFromPDF('./sample.pdf').then(data => console.log(data));

edited Feb 06 '21 at 23:51

CodeWizard

128,036
21
144
167

answered Apr 17 '20 at 18:46

Jack Fairfield

1,876
22
25

1

Hello @Jack, for me ONLY your code is working to extract text from pdfjs in nwjs app. I have been trying from 1 week. Finally got it working. Thanks a ton for your answer. – Sunil Kumar Aug 08 '20 at 09:38
Getting following error: The browser/environment lacks native support for critical functionality used by the PDF.js library (e.g. `Path2D` and/or `ReadableStream`); please use a `legacy`-build instead. – Abhay Apr 26 '23 at 12:48
Worked flowless bro,thank you. – Vibhu Pandey Jul 24 '23 at 07:36

score 8 · Accepted Answer · answered Jun 15 '15 at 10:46

Have you checked PDF2Json? It is built on top of PDF.js. Though it is not providing the text output as a single line but I believe you may just reconstruct the final text based on the generated Json output:

'Texts': an array of text blocks with position, actual text and styling informations: 'x' and 'y': relative coordinates for positioning 'clr': a color index in color dictionary, same 'clr' field as in 'Fill' object. If a color can be found in color dictionary, 'oc' field will be added to the field as 'original color" value. 'A': text alignment, including: left center right 'R': an array of text run, each text run object has two main fields: 'T': actual text 'S': style index from style dictionary. More info about 'Style Dictionary' can be found at 'Dictionary Reference' section

Thanks, this seems to do the job. -Bart – bartium Jun 16 '15 at 22:42 — bartium, Jun 16 '15 at 22:42

score 6 · Answer 3 · answered Jun 09 '20 at 10:18

Thought I'd chime in here for anyone who came across this question in the future. I had this problem and spent hours over literally all the PDF libraries on NPM. My requirements were that I needed to run it on AWS Lambda so could not depend on OS dependencies.

The code below is adapted from another stackoverflow answer (which I cannot currently find). The only difference being that we import the ES5 version which works with Node >= 12. If you just import pdfjs-dist there will be an error of "Readable Stream is not defined". Hope it helps!

import * as pdfjslib from 'pdfjs-dist/es5/build/pdf.js';

export default class Pdf {
  public static async getPageText(pdf: any, pageNo: number) {
    const page = await pdf.getPage(pageNo);
    const tokenizedText = await page.getTextContent();
    const pageText = tokenizedText.items.map((token: any) => token.str).join('');
    return pageText;
  }

  public static async getPDFText(source: any): Promise<string> {
    const pdf = await pdfjslib.getDocument(source).promise;
    const maxPages = pdf.numPages;
    const pageTextPromises = [];
    for (let pageNo = 1; pageNo <= maxPages; pageNo += 1) {
      pageTextPromises.push(Pdf.getPageText(pdf, pageNo));
    }
    const pageTexts = await Promise.all(pageTextPromises);
    return pageTexts.join(' ');
  }
}

Usage

const fileBuffer = fs.readFile('sample.pdf');
const pdfText = await Pdf.getPDFText(fileBuffer);

I'm also trying to get pdfjs running inside a lambda function. When I'm importing the library, I get the following error: Setting up fake worker failed: "Cannot find module './pdf.worker.js' Have you encountered this and, probably, found a solution for this? — florian norbert bepunkt, Jul 30 '21 at 20:32
Any chance you have updated this code for your needs in the past couple years. I have been using this solution and its ok, but I would love it if there was a solution as good as whatever is happening here: https://pdftotext.com/ — chrismead, Dec 08 '22 at 21:29
As I checked pdtto ext.com, they don't process pdf on the cloud, it looks like a browser solution using pdf.js — Kalana Perera, Jan 08 '23 at 15:44

score 2 · Answer 4 · answered Oct 31 '22 at 16:20

This solution worked for me using node 14.20.1 using "pdf-parse": "^1.1.1"

You can install it with:

yarn add pdf-parse

This is the main function which converts the PDF file to text.

const path = require('path');
const fs = require('fs');
const pdf = require('pdf-parse');
const assert = require('assert');

const extractText = async (pathStr) => {
  assert (fs.existsSync(pathStr), `Path does not exist ${pathStr}`)
  const pdfFile = path.resolve(pathStr)
  const dataBuffer = fs.readFileSync(pdfFile);
  const data = await pdf(dataBuffer)
  return data.text
}

module.exports = {
  extractText
}

Then you can use the function like this:

const { extractText } = require('../api/lighthouse/lib/pdfExtraction')

extractText('./data/CoreDeveloper-v5.1.4.pdf').then(t => console.log(t))

velop · Answer 5 · 2017-03-04T18:51:13.523

0

Instead of using the proposed PDF2Json you can also use PDF.js directly (https://github.com/mozilla/pdfjs-dist). This has the advantage that you are not depending on modesty who owns PDF2Json and that he updates the PDF.js base.

edited Mar 04 '17 at 18:51

answered Jun 06 '16 at 12:50

velop

3,102
1
27
30

PDF to Text extractor in nodejs without OS dependencies

5 Answers5

Linked