I have to put the data from a pdf file in a certain database structure. This requires me to be able to get certain data out of the pdf file. Since pdf hasn't got any tags etc ... i was wondering if it is possible to get text based on a color. Say for example i want all the red text. Or i want all the italic text in the document. Is this possible in C# ? Or is there an other way to easily filter data in a pdf document ?
Asked
Active
Viewed 5,609 times
4 Answers
1
I've taken a different approach. I converted the pdf to an excel file. And this was very easy to search for the coloured text

Olivier_s_j
- 5,490
- 24
- 80
- 126
0
By using this library http://www.codeproject.com/KB/files/xpdf_csharp.aspx?msg=3154408 you have an access to every word style (font, color...)
this.pdfDoc.Pages[4].WordList.ElementAt(143).ForeColor

anth
- 1,724
- 1
- 19
- 22
0
iText's PdfTextExtractor (and all the code it rests on) DOES NOT track the current color. Ouch. It wouldn't be all that hard to add, so you could modify iText yourself:
- Add stroke and fill color members to the GraphicState class (and update the various constructors appropriately).
- You'd need to add
ContentOperator
classes for 'g', 'G', 'rg', 'RG', 'K', and 'k' (and maybe CS, cs, SC, sc, SCN, scn), to modify the stroke and fill colors. - Add methods to TextRenderInfo to get the current stroke and fill colors.

Mark Storer
- 15,672
- 3
- 42
- 80
0
Try PdfLibTET http://www.pdflib.com/products/tet/
It should be able to get informations about text.

Fabrizio Accatino
- 2,284
- 20
- 24