To extract the content of a PDF in a React.js app, you can use the pdfjs-dist
library, which provides functionality for working with PDF files. Here's an example of how you can achieve this:
Install the required packages:
Start by installing the pdfjs-dist
package using npm or yarn:
npm install pdfjs-dist
Import the required modules in your component:
import { Document, Page } from 'react-pdf/dist/esm/entry.webpack';
import pdfjs from 'pdfjs-dist';
Configure the PDF.js library:
Before loading the PDF file, you need to configure the pdfjs
library by setting the correct path to the worker file. You can do this in the component where you'll be working with PDF files:
pdfjs.GlobalWorkerOptions.workerSrc = `//cdnjs.cloudflare.com/ajax/libs/pdf.js/${pdfjs.version}/pdf.worker.js`;
Load and extract content from the PDF:
In your component, you can load the PDF file and extract its content. Here's an example using a function component and hooks:
import React, { useState } from 'react';
const PdfExtractor = () => {
const [numPages, setNumPages] = useState(null);
const [pdfText, setPdfText] = useState('');
const onDocumentLoadSuccess = ({ numPages }) => {
setNumPages(numPages);
// Extract text from each page
const textPromises = [];
for (let i = 1; i <= numPages; i++) {
textPromises.push(
pdfjs.getDocument({ url: 'path/to/pdf/file.pdf' })
.then((pdf) => pdf.getPage(i))
.then((page) => page.getTextContent())
.then((textContent) => {
const pageText = textContent.items.map((item) => item.str).join(' ');
return pageText;
})
);
}
Promise.all(textPromises)
.then((pageTexts) => {
const extractedText = pageTexts.join(' ');
setPdfText(extractedText);
})
.catch((error) => console.error('Failed to extract PDF text:', error));
};
return (
<div>
<Document
file="path/to/pdf/file.pdf"
onLoadSuccess={onDocumentLoadSuccess}
>
{Array.from(new Array(numPages), (el, index) => (
<Page key={`page_${index + 1}`} pageNumber={index + 1} />
))}
</Document>
<div>{pdfText}</div>
</div>
);
};
export default PdfExtractor;
In the above example, replace 'path/to/pdf/file.pdf'
with the actual path or URL of your PDF file.
The onDocumentLoadSuccess
function is called when the PDF is successfully loaded. It extracts the text content from each page of the PDF and joins them together.
The extracted text is stored in the pdfText
state variable, which can be rendered within the component or used as needed.
The Document
component from react-pdf
is used to render the PDF pages, and the Page
component represents each individual page.
By following these steps, you can extract the content of a PDF in a React.js app using the pdfjs-dist
library.
UPDATE:
To allow file selection using the <input>
component, you can do as follows:
import { useState } from 'react';
import { PDFDocument } from 'pdfjs-dist';
function YourComponent() {
const [pdfContent, setPdfContent] = useState('');
const handleFileChange = async (event) => {
const file = event.target.files[0];
const reader = new FileReader();
reader.onload = async (e) => {
const contents = e.target.result;
const pdf = await PDFDocument.load(contents);
const pages = pdf.getPages();
let extractedText = '';
for (const page of pages) {
const textContent = await page.getTextContent();
const pageText = textContent.items.map((item) => item.str).join(' ');
extractedText += pageText;
}
setPdfContent(extractedText);
};
reader.readAsArrayBuffer(file);
};
return (
<div>
<input type="file" onChange={handleFileChange} />
<div>{pdfContent}</div>
</div>
);
}
export default YourComponent;