Extracting text from a PDF file

Question

I need to extract the text from a PDF file. This text will likely be in a table format, and it is going to be used for automatic transfer of data between an external party and our systems.

Can anyone suggest a command line tool (eg pdf to txt) or a library that would be good for this?

Language options:

C# (preferred)
Java (if I must)

I found some ideas here, but i think the guy was talking more about a one-off situation, i'm talking more like a daily import:

https://stackoverflow.com/questions/488089/extracting-tables-from-pdf-files

Do you want to retain the table format of the text? If so, the task will become considerably more difficult, if not, then any of the suggested PDF to text libraries should do. — Rowan, Aug 14 '09 at 04:48
Table format isn't important, it just needs to be machine-readable so i can parse it and shove it into a database. — Chris, Aug 14 '09 at 06:40

score 4 · Accepted Answer · answered Aug 14 '09 at 04:27

4

try this

http://www.codeproject.com/KB/cs/PDFToText.aspx

Bye

answered Aug 14 '09 at 04:27

RRUZ

134,889
20
356
483

That uses itextsharp, for later reference – Chris Aug 14 '09 at 05:03

Anton Geraschenko · Answer 2 · 2009-08-14T05:12:23.183

4

pdftotext seems to do the trick quite nicely.

pdftotext file.pdf [textfile.txt]

Edit: I'm not sure how you would like to retain information about the tables. The best looking output (to my human eye, at least) is produced by

pdftotext -layout file.pdf [textfile.txt]

This maintains the original layout of the document as best as possible. In particular, the tables still look pretty good in the text output. The default is to interpret the columns of the table as columns of text (terrible). Another option that doesn't look as good to me, but might still be useful, is the -raw option.

edited Aug 14 '09 at 05:12

answered Aug 14 '09 at 04:40

Anton Geraschenko

1,409
2
11
20

According to Wikipedia, `xpdf` does have an implementation of `pdftotext`. The one I have came in the `poppler-utils` package. I can't seem to find a pdf with a table in it to test what the output looks like. What kind of output would you like? – Anton Geraschenko Aug 14 '09 at 04:54
Looks like poppler is a fork of xpdf, so its probably the same tool. – Chris Aug 14 '09 at 06:39
I used the xpdf version of this and was very happy with the result. The -layout flag _really_ helped as Anton notes above. – Tim Perry Jul 07 '10 at 23:18

score 1 · Answer 3 · answered Aug 14 '09 at 04:52

1

I can't provide a solution but only offer general advice. My advice to you is to open a PDF document in Notepad or another Plain Text editor and study the formatting codes. They're very easy to understand. For example, //par is a Paragraph and //tab is a Tab. Once you know the formatting codes for table layouts, it'll be very easy for you to come up with your own solution to extract anything from a PDF document.

answered Aug 14 '09 at 04:52

jay_t55

11,362
28
103
174

1

It's not that easy. There's a lot of work involved in extracting text from a document in a human readable format. The task becomes more a bit easier if you just need to extract text from the same document every time, but if you need to extract text from random documents, from varying sources, it's not easy at all. So I wouldn't recommend this option unless you want to spend quite a bit of time perfecting it and really cannot use any third party libraries. – Rowan Aug 14 '09 at 23:56

score 1 · Answer 4 · answered Aug 14 '09 at 06:12

1

There is also PdfBox and JPedal on Java. Tables do not exist in the PDF file format so any software will be 'guessing' them.

answered Aug 14 '09 at 06:12

mark stephens

449
3
2

score 1 · Answer 5 · answered Aug 14 '09 at 07:10

Apache Tika is open-source Java toolkit that specializes in what you are looking for: extracting structured context from various documents including pdf.

It does use PDFBox for pdf file format but provides level of abstraction that is ideal for extracting structured context.

It contains command line utility - see here.

Bobrovsky · Answer 6 · 2020-08-07T12:08:03.653

Tabular data in PDF are usually hard to extract properly because most of PDF files out there do not contain Structured Content metadata. And without this metadata PDF files a just a pile of text and other operations. Most of the times only human can say if there is a table in a document.

Almost any sufficiently advanced tools and libraries try to structure text extracted from PDF in some way using heuristics. Results of course vary from tool to tool and from library to library.

You can try Docotic.Pdf library (disclaimer: I work for Bit Miracle) to extract text from PDF files. I think that the library should extract text with quality sufficient to further processing.

Please take a look at a sample that shows how to extract text from PDF.

score 0 · Answer 7 · answered Aug 14 '09 at 04:42

0

try the opensource java pdf library

http://www.lowagie.com/iText/docs.html

answered Aug 14 '09 at 04:42

janetsmith

8,562
11
58
76

Extracting text from a PDF file

7 Answers7

Linked