How to get text with a certain color from a pdf c#

Question

I have to put the data from a pdf file in a certain database structure. This requires me to be able to get certain data out of the pdf file. Since pdf hasn't got any tags etc ... i was wondering if it is possible to get text based on a color. Say for example i want all the red text. Or i want all the italic text in the document. Is this possible in C# ? Or is there an other way to easily filter data in a pdf document ?

enter image description here

iText pdf, but haven't found the functionality i'm looking for. So i'm open to any suggestions regarding the libraries etc — Olivier_s_j, May 03 '11 at 15:45

score 1 · Accepted Answer · answered May 04 '11 at 17:12

1

I've taken a different approach. I converted the pdf to an excel file. And this was very easy to search for the coloured text

answered May 04 '11 at 17:12

Olivier_s_j

5,490
24
80
126

score 0 · Answer 2 · answered May 03 '11 at 16:14

0

By using this library http://www.codeproject.com/KB/files/xpdf_csharp.aspx?msg=3154408 you have an access to every word style (font, color...)

this.pdfDoc.Pages[4].WordList.ElementAt(143).ForeColor

answered May 03 '11 at 16:14

anth

1,724
1
19
22

score 0 · Answer 3 · answered May 03 '11 at 18:13

iText's PdfTextExtractor (and all the code it rests on) DOES NOT track the current color. Ouch. It wouldn't be all that hard to add, so you could modify iText yourself:

Add stroke and fill color members to the GraphicState class (and update the various constructors appropriately).
You'd need to add ContentOperator classes for 'g', 'G', 'rg', 'RG', 'K', and 'k' (and maybe CS, cs, SC, sc, SCN, scn), to modify the stroke and fill colors.
Add methods to TextRenderInfo to get the current stroke and fill colors.

score 0 · Answer 4 · answered May 03 '11 at 19:35

0

Try PdfLibTET http://www.pdflib.com/products/tet/
It should be able to get informations about text.

answered May 03 '11 at 19:35

Fabrizio Accatino

2,284
20
24

How to get text with a certain color from a pdf c#

4 Answers4

Linked