Using pdf.js, i have made a simple function for extract the raw text from a pdf:
async getPdfText(path){
const pdf = await PDFJS.getDocument(path);
const pagePromises = [];
for (let j = 1; j <= pdf.numPages; j++) {
const page = pdf.getPage(j);
pagePromises.push(page.then((page) => {
const textContent = page.getTextContent();
return textContent.then((text) => {
return text.items.map((s) => s.str).join('');
});
}));
}
const texts = await Promise.all(pagePromises);
return texts.join('');
}
// usage
getPdfText("C:\\my.pdf").then((text) => { console.log(text); });
however i can't find a way for extract correctly the new lines, all the text is extracted in only one line.
How extract correctly the text? i want extract the text in the same way as on desktop pc:
Open the pdf (doble click on the file) -> select all text (CTRL + A) -> copy the selected text (CTRL + C) -> paste the copied text (CTRL + V)