Parsing pdf files

Question

I have a requirement to split a large pdf document into smaller files based on the content of the file. We use BCL easyPDF to manipulate pdf files. easyPDF can split pdf documents based on a page number, but it cannot split the document based on the file content. Also it does not have a search function (as far as I can tell, if I am wrong please someone let me know.) to determine the location of the content.

Now can someone tell me how I can find the location of text in a pdf file using .net?

Thanks

yes but it should/is a community where we can help people who may be still learning the ins and outs of a language or protocol. We can try to point them in the right direction. — Brian, May 03 '12 at 18:24
Isn't PDF a sort of binary file? You cannot just parse it as text. A library is required — Alex, Jan 18 '17 at 16:51
I start out my year with my usual complaint. Why is this off topic ( I know the rules say it is) but its very useful, many of the preserved, 'best' questions (which you cannot find now I see) are of this nature. They represent the accumulated advice (aka wisdom) of many experienced devs — pm100, Jan 04 '19 at 00:36

Bobrovsky · Answer 1 · 2020-08-07T11:33:44.067

3

You might try Docotic.Pdf library for your task.

The library can extract text from PDFs (with or without formatting).

Or you could just retrieve a collection of words with their bounding rectangles from PDFs. This should help you to find location of the text in a file.

Disclaimer: I work for the vendor of the library.

edited Aug 07 '20 at 11:33

answered May 04 '12 at 15:45

Bobrovsky

13,789
19
80
130

NOTE: As Bobrovsky mentions, this is a commercial product. Its price is non-trivial (though appropriate for what it does). – ToolmakerSteve Jan 04 '19 at 00:23

score 2 · Answer 2 · edited Jan 04 '19 at 00:24

2

You need a PDF library in .NET such as iText.Net.

edited Jan 04 '19 at 00:24

ToolmakerSteve

18,547
14
94
196

answered May 03 '12 at 18:23

Pablo Santa Cruz

176,835
32
241
292

score 1 · Answer 3 · edited May 23 '17 at 11:46

1

take a look at this question. there are links to some libraries that may satisfy your requirements

How to programatically search a PDF document in c#

edited May 23 '17 at 11:46

Community

1
1

answered May 03 '12 at 18:22

Brian

2,229
17
24

Parsing pdf files

3 Answers3