How to extract content of PDF in React.js?

Question

I am trying to load PDF file of my local storage then extract content in React.js without any backend.

I tried to find similar modules from google, but didn't find proper module yet. There are many node modules for parsing PDFs, and I can extract content of PDF in backend, but I am not sure we can use it in web browsers.

https://stackoverflow.com/questions/49763744/how-to-extract-text-from-a-pdf-file-in-javascript — JASBIR SINGH, May 16 '23 at 13:02
Thanks JASBIR, But I need to load PDF of my local driver in React.js. In this case I can get PDF using input DOM. And then I don't know how I can do next. — TopTen1310, May 16 '23 at 13:08

Prayas Jain · Accepted Answer · 2023-05-16T15:00:33.697

To extract the content of a PDF in a React.js app, you can use the pdfjs-dist library, which provides functionality for working with PDF files. Here's an example of how you can achieve this:

Install the required packages: Start by installing the pdfjs-dist package using npm or yarn:
```
npm install pdfjs-dist
```

Import the required modules in your component:

import { Document, Page } from 'react-pdf/dist/esm/entry.webpack';
import pdfjs from 'pdfjs-dist';

Configure the PDF.js library: Before loading the PDF file, you need to configure the pdfjs library by setting the correct path to the worker file. You can do this in the component where you'll be working with PDF files:
```
pdfjs.GlobalWorkerOptions.workerSrc = `//cdnjs.cloudflare.com/ajax/libs/pdf.js/${pdfjs.version}/pdf.worker.js`;
```

Load and extract content from the PDF: In your component, you can load the PDF file and extract its content. Here's an example using a function component and hooks:

import React, { useState } from 'react';

const PdfExtractor = () => {
  const [numPages, setNumPages] = useState(null);
  const [pdfText, setPdfText] = useState('');

  const onDocumentLoadSuccess = ({ numPages }) => {
    setNumPages(numPages);

    // Extract text from each page
    const textPromises = [];
    for (let i = 1; i <= numPages; i++) {
      textPromises.push(
        pdfjs.getDocument({ url: 'path/to/pdf/file.pdf' })
          .then((pdf) => pdf.getPage(i))
          .then((page) => page.getTextContent())
          .then((textContent) => {
            const pageText = textContent.items.map((item) => item.str).join(' ');
            return pageText;
          })
      );
    }

    Promise.all(textPromises)
      .then((pageTexts) => {
        const extractedText = pageTexts.join(' ');
        setPdfText(extractedText);
      })
      .catch((error) => console.error('Failed to extract PDF text:', error));
  };

  return (
    <div>
      <Document
        file="path/to/pdf/file.pdf"
        onLoadSuccess={onDocumentLoadSuccess}
      >
        {Array.from(new Array(numPages), (el, index) => (
          <Page key={`page_${index + 1}`} pageNumber={index + 1} />
        ))}
      </Document>
      <div>{pdfText}</div>
    </div>
  );
};

export default PdfExtractor;

In the above example, replace 'path/to/pdf/file.pdf' with the actual path or URL of your PDF file.

The onDocumentLoadSuccess function is called when the PDF is successfully loaded. It extracts the text content from each page of the PDF and joins them together.

The extracted text is stored in the pdfText state variable, which can be rendered within the component or used as needed.

The Document component from react-pdf is used to render the PDF pages, and the Page component represents each individual page.

By following these steps, you can extract the content of a PDF in a React.js app using the pdfjs-dist library.

UPDATE:

To allow file selection using the <input> component, you can do as follows:

import { useState } from 'react';
import { PDFDocument } from 'pdfjs-dist';

function YourComponent() {
  const [pdfContent, setPdfContent] = useState('');

  const handleFileChange = async (event) => {
    const file = event.target.files[0];
    const reader = new FileReader();

    reader.onload = async (e) => {
      const contents = e.target.result;
      const pdf = await PDFDocument.load(contents);
      const pages = pdf.getPages();
      let extractedText = '';

      for (const page of pages) {
        const textContent = await page.getTextContent();
        const pageText = textContent.items.map((item) => item.str).join(' ');
        extractedText += pageText;
      }

      setPdfContent(extractedText);
    };

    reader.readAsArrayBuffer(file);
  };

  return (
    <div>
      <input type="file" onChange={handleFileChange} />
      <div>{pdfContent}</div>
    </div>
  );
}

export default YourComponent;

Hi Prayas. Thanks for your answer. But you mentioned about 'path/to/pdf/file.pdf', and it means that the pdf file should be located into public folder, right? I want to load PDF using input DOM. ( — TopTen1310, May 16 '23 at 14:11
@PrayasJain I'm trying out your code, as of version 3.7.107 of `pdf-dist`, there doesn't seem to be a `PDFDocument` that's exported — PGT, Jun 14 '23 at 03:53
This answer looks like it was generated by an AI (like ChatGPT), not by an actual human being. You should be aware that [posting AI-generated output is officially **BANNED** on Stack Overflow](https://meta.stackoverflow.com/q/421831). If this answer was indeed generated by an AI, then I strongly suggest you delete it before you get yourself into even bigger trouble: **WE TAKE PLAGIARISM SERIOUSLY HERE.** Please read: [Why posting GPT and ChatGPT generated answers is not currently allowed](https://stackoverflow.com/help/gpt-policy). — tchrist, Jul 11 '23 at 13:34

How to extract content of PDF in React.js?

1 Answers1