1

I have a PDF that consists of different color text and background color. How do I identify which colors are used in the PDF with CMYK or RGB format?

StringBuilder sb_Sourcepdf = new StringBuilder();
PdfReader reader_FirstPdf = new PdfReader(pdf_of_FirstFile);

Document document = new Document();

PDFParser parser = new PDFParser(new FileInputStream(pdf_of_FirstFile));
parser.parse();
PDDocument docum = parser.getPDDocument();

PDFStreamEngine engine = new PDFStreamEngine();

PDPage page = (PDPage)docum.getDocumentCatalog().getAllPages().get(0);

engine.processStream(page, page.findResources(), page.getContents().getStream());
PDGraphicsState graphicState = engine.getGraphicsState();
string colorname = graphicState.getStrokingColor().getColorSpace().getName();
graphicState.getTextState().getFont();
int r = graphicState.getNonStrokingColor().getJavaColor().getRed();
int g = graphicState.getNonStrokingColor().getJavaColor().getGreen();
int b = graphicState.getNonStrokingColor().getJavaColor().getBlue();
int rgb = graphicState.getNonStrokingColor().getJavaColor().getRGB();
float[] cosp = graphicState.getNonStrokingColor().getColorSpaceValue();
PDColorSpace pd = graphicState.getNonStrokingColor().getColorSpace();

string re = graphicState.getStrokingColor().toString();
int rgbcolor = graphicState.getStrokingColor().getJavaColor().getRGB();

float[] components = { java.awt.Color.black.getRed(), java.awt.Color.black.getGreen(), java.awt.Color.black.getBlue() };

float[] colorSpaceValues = graphicState.getStrokingColor().getColorSpaceValue();


foreach (float c in colorSpaceValues)
{
    Debug.WriteLine(c * 255.00);
}

I used pdfbox but I am getting value as 0.0

Pragya
  • 146
  • 1
  • 13
  • 1
    which version of PDFbox are you using ? – pdp Apr 18 '13 at 09:04
  • If you're using PDFBox, why are you tagging the question as 'itextsharp'? – Bruno Lowagie Apr 18 '13 at 09:11
  • @ Bruno Lowagie I wanted to know is it possible to do in itextsharp. because i am not able to get value using pdfbox. for text extraction of PDF i have used itextsharp – Pragya Apr 18 '13 at 09:22
  • @Pragya Currently the parser package of iText does ignore text colors. It is moderately easy to extend it to also provide the coloring information. That being said, your PDFBox code seems to inspect the graphics state only at the start or end of the page description (I don't know which state `engine` is in after `engine.processStream` has been called) while you need the state of the moment when the text you want to inspect was rendered. Furthermore you have to take the text render mode into account to see whether stroking color, non-stroking color, both, or neither apply. – mkl Apr 19 '13 at 06:40
  • @mkl is there any other method to get color value ? – Pragya Apr 22 '13 at 05:08
  • @Pragya Do you mean using PDFBox? I'm sure that after processing the stream PDFBox allows you to iterate the individual elements of it (among them the text strings) and query the graphics state valid when those elements are printed. Or do you mean iText? Of course you are not bound to use the iText parser package as base; it does already do all the heavy lifting, though, so I don't know why you would not want to use it. – mkl Apr 22 '13 at 07:24

1 Answers1

1
 PdfReader reader_FirstPdf = new PdfReader(pdf_of_FirstFile);


            for (int i = 1; i <= reader_FirstPdf.NumberOfPages; i++)
            {
 TextWithFont_SourcePdf Sourcepdf = new TextWithFont_SourcePdf();
}
                text_First_File = iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(reader_FirstPdf, i, Sourcepdf);


            public void RenderText(iTextSharp.text.pdf.parser.TextRenderInfo renderInfo)
            {
 int r = renderInfo.GetColorNonStroke().R;
                  int g = renderInfo.GetColorNonStroke().G;
                   int b = renderInfo.GetColorNonStroke().B;

}
pdp
  • 609
  • 9
  • 22