extract text from pdf file using only javascript

Question

How can I extract data from pdf file, using only javascript, in client side and with any browser?

Does this answer your question? https://stackoverflow.com/questions/1554280/extract-text-from-pdf-in-javascript — John Goofy, Jan 10 '21 at 13:09

Christophe · Answer 1 · 2012-12-13T23:48:44.640

2

pdf.js is a JavaScript pdf reader: http://mozilla.github.com/pdf.js/

Some similar projects:

for docx and xlsx: http://blog.innovatejs.com/?p=184
jsPDF is a pdf generator: https://github.com/MrRio/jsPDF

If you are asking how to load the file, this can be done via an ajax request, but you won't be able to directly read the file content.

edited Dec 13 '12 at 23:48

answered Dec 13 '12 at 23:43

Christophe

27,383
28
97
140

As per millimoose's comments, the output is still atrocious with pdf.js, but it seems like it could be used with some hacks to extract text information "in a sufficient manner", fsvo sufficient. – Dec 14 '12 at 01:19
@pst right, still it's a nice effort and there's not much choice. – Christophe Dec 14 '12 at 01:28

millimoose · Answer 2 · 2012-12-13T23:51:11.510

-1

What you're asking is practically impossible.

PDF is a heavyweight format optimised towards efficient display of large complex documents, not towards further processing. (In fact, PDF documents primarily consist of letter shapes and other graphics absolutely positioned on pages. Any data representing "paragraphs of text" is an optional feature of tagged PDFs.)

Text extraction tends to be a feature of (usually expensive) PDF libraries, and to the best of my knowledge no such library exists for Javascript. Scribd and Google Docs do this, but they probably don't share how, and my guess is they do this on the server side.

tl;dr: PDF, as a format, is terrible for this. Unless basically the entire point of your application is extracting text from PDFs, your time would be better spend on figuring out how to not have to do it.

edited Dec 13 '12 at 23:51

answered Dec 13 '12 at 23:32

millimoose

39,073
9
82
134

What about https://github.com/mozilla/pdf.js/? Not sure if it has an easy API... – elclanrs Dec 13 '12 at 23:39
1

@etclanrs From the description that's a PDF *rendering* library. Displaying PDF is a separate problem from extracting data suitable for text processing. An analogy would be drawing a JPEG versus recognizing faces in one. Low-level code to parse the raw data is the same, but interpreting this data is completely different. – millimoose Dec 13 '12 at 23:44
@elclanrs There's a nonzero probability it has or might within a reasonable timeframe have the capability to extract whatever such data is in *tagged PDFs*, but seeing as – like every GitHub library it seems – there's zero accessible reference documentation, it's a chore to tell whether that's the case. – millimoose Dec 13 '12 at 23:46
I feel you. There's this though https://github.com/mozilla/pdf.js/blob/master/src/api.js. Seems good enough for now, I mean it's just an alpha product anyway. – elclanrs Dec 13 '12 at 23:49
@millimoose once the pdf is rendered as html, I assume it should not be too difficult to extract the data you need. – Christophe Dec 13 '12 at 23:56
It is hard enough getting the text out of a pdf using a full powered programming language, so pdf.js is probably more of a dream than a reality. It also looks like vaporware with no documentation, which might be a liability to a large project. – le3th4x0rbot Dec 13 '12 at 23:56
@elclanrs Well, the source tree makes only one offhand mention of tagged PDFs, so that seems to be a bust. The tags might be accessible from the low-level data structures the library parses the documents into, but how to work with those requires a knowledge of the PDF format that's way more intimate than mine. (Or that of many people really. There's a reason why most of the good PDF libraries are proprietary and expensive.) – millimoose Dec 13 '12 at 23:56
@Christophe That's wishful thinking. PDF rendered into HTML can very easily be "single letters absolutely positioned in a `
`". Perfectly readable to a human, not so to code. It's a long way from "meaningful paragraphs". Libraries that extract text from arbitrary PDFs basically look for aligned rows of letter shapes, and use spacing between the letters and blocks of such rows to determine what "words" and "paragraphs" are.
– millimoose Dec 13 '12 at 23:59
1

@Christophe You can get a glimpse at what can / has to be done by looking at the configuration options for Calibre's converter: http://manual.calibre-ebook.com/cli/ebook-convert.html (In fact, were I to do a feature like this, my first approach would be feeding the PDF to calibre and crossing my fingers. Even then it's unlikely the results will be always satisfactory without the end user tweaking those parameters.) – millimoose Dec 14 '12 at 00:04
For text there is [this](http://hublog.hubmed.org/archives/001948.html) – Mike H-R Mar 06 '14 at 11:51

extract text from pdf file using only javascript

2 Answers2