Extracting text from PDFs in C#

Question

Pretty simply, I need to rip text out of multiple PDFs (quite a lot actually) in order to analyse the contents before sticking it in an SQL database.

I've found some pretty sketchy free C# libraries that sort of work (the best one uses iTextSharp), but there are umpteen formatting errors and some characters are scrambled and alot of the time there are spaces (' ') EVERYWHERE - inside words, between every letter, huge blocks of them taking up several lines, it all seems a bit random.

Is there any easy way of doing this that I'm completely overlooking (quite likely!) or is it a bit of an arduous task that involves converting the extracted byte values into letters reliably?

See also http://stackoverflow.com/q/10982156/292060 – goodeye Jul 29 '16 at 01:20 — goodeye, Jul 29 '16 at 01:20

score 27 · Answer 1 · answered Jan 22 '10 at 10:16

There may be some difficulty in doing this reliably. The problem is that PDF is a presentation format which attaches importance to good typography. Suppose you just wanted to output a single word: Tap.

A PDF rendering engine might output this as 2 separate calls, as shown in this pseudo-code:

moveto (x1, y); output ("T")
moveto (x2, y); output ("ap")

This would be done because the default kerning (inter-letter spacing) between the letters T and a might not be acceptable to the rendering engine, or it might be adding or removing some micro space between characters to get a fully justified line. What this finally results in is that the actual text fragments found in PDF are very often not full words, but pieces of them.

Excellent description of the potential difficulties in extracting text from PDF. — Lunatik, Dec 24 '10 at 10:12
To add to this, PDF ultimately places *glyphs*, not text. In many PDFs the glyph indexes are the same as the Unicode code point, but they may differ, in which case you get only gibberish unless text has been explicitly added as well. — Joey, Sep 12 '18 at 09:31

David Hammond · Answer 2 · 2018-04-27T14:26:24.840

18

Take a look at Tika on DotNet, available through Nuget: https://www.nuget.org/packages/TikaOnDotnet.TextExtractor/

This is a wrapper around the extremely good Tika java library, using IKVM. Very easy to use and handles a wide variety of file types other than PDF, including old and new office formats. It will auto-select the parser based on the file extension, so it's as easy as:

var text = new TextExtractor().Extract(file.FullName).Text;

Update: One caution with this solution is that development on IKVM has ended. I'm not sure what this will mean in the long run. http://weblog.ikvm.net/2017/04/21/TheEndOfIKVMNET.aspx

edited Apr 27 '18 at 14:26

answered Apr 10 '15 at 13:21

David Hammond

3,286
1
24
18

This works. However, I'm having an issue with ligature characters 'ffi', 'ff', 'fi', etc, being presented as �. Have you had this issue? – kurt Mar 26 '18 at 11:25
@kurt, I have not noticed that problem, but I have mostly been feeding English text into a search index, where problems with ligature characters wouldn't be a major issue. – David Hammond Apr 02 '18 at 18:55

Tony Qu · Answer 3 · 2022-01-03T13:56:21.947

3

You can try Toxy, a text/data extraction framework in .NET. It supports .NET standard 2.0. For detail, please visit https://github.com/nissl-lab/toxy

edited Jan 03 '22 at 13:56

answered Jan 14 '14 at 23:28

Tony Qu

676
8
14

3

-1 This doesn't have anything to do with PDFs (yet). You might as well tell us to visit http://www.websitethatplansonhavingcodetoextracttextfrompdfsoneday.com – David Murdoch Feb 04 '14 at 20:07
1

I said it will. Anyway, you will see it soon. I'll make it available before June. – Tony Qu Feb 22 '14 at 23:26
2

Toxy 1.0 is here. It supports PDF now – Tony Qu Jun 11 '14 at 02:57
Update your answer so I can remove my -1. :-) – David Murdoch Jun 11 '14 at 12:13
Licensing is a big question mark as Toxy uses several external libraries, e.g. iTextSharp which is AGPL unless you purchase license. – Jussi Palo Jan 31 '15 at 09:43
@JussiPalo I didn't pay attention to iTextSharp license before. I will look for a replacement for iTextSharp. AGPL is not acceptable. LGPL, MIT or Apache is preferred. – Tony Qu Feb 23 '15 at 01:15
PDF Clown is the only one I've found after extensive research that might be feasible also in terms of licensing, but I couldn't get it to work with half day of struggling. Looked promising, though. – Jussi Palo Feb 23 '15 at 10:24
2

I've updated Toxy to use PDFSharp which uses MIT license. The new update is available in Toxy 1.4 release – Tony Qu Feb 24 '15 at 23:43
[Toxy project will not be developed anymore](https://github.com/tonyqus/toxy/issues/16) – dontbyteme Oct 11 '18 at 08:51
@DavidMurdoch Link is broken. – m_a_s Jan 02 '22 at 18:57
@m_a_s that was the point of the comment from almost 8 years ago. The OP posted a link to something that didn't exist yet. – David Murdoch Jan 03 '22 at 20:28
@DavidMurdoch I know. It was a poor joke. Just trying to be "that guy". Cheers! – m_a_s Jan 03 '22 at 23:07

score 3 · Answer 4 · answered Jun 22 '15 at 09:44

In case you are processing PDF files with the purpose of importing data into a database then I suggest to consider ByteScout PDF Extractor SDK. Some useful functions included are

table detection;
text extraction as CSV, XML or formatted text (with the optional layout restoration);
text search with support for regular expressions;
low-level API to access text objects

DISCLAIMER: I'm affiliated with ByteScout

Bobrovsky · Answer 5 · 2020-08-07T12:09:11.867

1

You can try Docotic.Pdf library (disclaimer: I work for Bit Miracle) to extract text from PDF files. The library uses some heuristics to extract nice looking text without unwanted spaces between letters in words.

Please take a look at a sample that shows how to extract text from PDF.

edited Aug 07 '20 at 12:09

answered May 22 '11 at 07:25

Bobrovsky

13,789
19
80
130

It did not worked fine for scanned PDF documents/low resolution documents . – rose Jul 03 '19 at 10:01

score 0 · Answer 6 · answered Feb 23 '15 at 10:27

0

If you're looking for "free" alternative, check out PDF Clown. I personally have used iFilter based approach, and it seems to work fine in case you would need to support other file types easily. Sample code here.

answered Feb 23 '15 at 10:27

Jussi Palo

848
9
26

Extracting text from PDFs in C#

6 Answers6

Linked

Related