0

I want to split TJ/Tj operator's COSString using the PDFBOX.

My pdf current content stream looks like below.

enter image description here

Desired output

enter image description here

or

enter image description here

what I tried?

 public static void SplitTj_TJ(int tj_ind, PDDocument document) throws IOException{
      PDPage page = document.getPage(0);
      PDFStreamParser parser = new PDFStreamParser(page);
      parser.parse();
      List tokens = parser.getTokens();
      Operator op = (Operator) tokens.get(tj_ind);
      COSFloat dest_x = new COSFloat((float) 90.81199646);
      COSFloat dest_y = new COSFloat((float) 0);
      if ( tokens.get(tj_ind) instanceof Operator && (op.getName().equals("TJ") || op.getName().equals("Tj"))){
          COSArray tj_array = (COSArray) tokens.get(tj_ind-1);
          tokens.remove(tj_ind);
          tokens.remove(tj_ind-1);
          tokens.add((int) (tj_ind-1),  tj_array.get(0));
          tokens.add((int) (tj_ind),  Operator.getOperator("Tj"));
          tj_array.remove(0);
          tokens.add((int) (tj_ind+1), dest_x);
          tokens.add((int) (tj_ind+2), dest_y);
          tokens.add((int) (tj_ind+3), Operator.getOperator("Td"));
          tokens.add((int) (tj_ind+4),  tj_array.get(1));
          tokens.add((int) (tj_ind+5), Operator.getOperator("Tj"));
          tokens.remove(tj_ind+9);
          tokens.add((int) (tj_ind+9), new COSFloat((float) -90.81199646));

          System.out.println("!@#$%^&*(*&^@#$%^&^$#@#$%^&^$#@#$%^%$#@#$%^%#@#$%^%#@#^");
          PDStream newContents = new PDStream(document);
          OutputStream out = newContents.createOutputStream(COSName.FLATE_DECODE);
          ContentStreamWriter writer = new ContentStreamWriter(out);
          writer.writeTokens(tokens);
          System.out.println("Count at end :::::"+tokens.size());
          out.close();
          document.getPage(0).setContents(newContents);
          PDDocument pdf = new PDDocument();
          pdf.addPage(document.getPage(0));
          pdf.save("D:/Testfiles/brigs11.pdf");
          pdf.close();


      }
  }

I am not sure this will work for all cases. What is the generic code to make it work .

How can I achieve this using PDFBOX. I can able to split all the TJ/Tj's under the all type of text position operators without messing up the existing stream?

  • What have you tried yet? It is not trivial as you have to keep track of the text state along the whole stream. I would base a solution on the `PdfContentStreamEditor` from [this answer](https://stackoverflow.com/questions/58475104/filter-out-all-text-above-a-certain-font-size-from-pdf) which keeps track of the graphics state, but there still is some work to do. – mkl May 21 '20 at 11:57
  • Added some sample code. please help me in above available parameters case. – fascinating coder May 22 '20 at 01:01
  • @mkl Did you get a chance to look in to this? I would be be very helpfull. – fascinating coder May 26 '20 at 12:05
  • @mkl Any update on this? At least give me some clue(examples of split TJ operators) how can I achieve it. I will try and let you know. – fascinating coder Jul 06 '20 at 05:25
  • The problem is that this is quite a lot more complex than your code. In particular you need the current text state to properly implement the replacement you're after. To keep track of that you must keep track of it while going through those tokens, including a stack of saved text states. Implementing that as you do (i.e. walking the list of tokens in a content stream manually) is quite an act. I would propose instead using the `PdfContentStreamEditor` mentioned in my first comment or some similar class doing the heavy lifting for you. ... – mkl Jul 06 '20 at 12:48
  • One question, though, what do you really want to achieve? In particular, if it sufficed to replace `[(Chapter)-375(12)]TJ` by `(Chapter)Tj [-375]TJ (12)Tj` and you didn't need to make that `[-375]TJ` some `Td` operation, this would be much easier... – mkl Jul 06 '20 at 13:01
  • yes, [-375]TJ not required in this case. Let me try and get back to you. – fascinating coder Jul 07 '20 at 09:56
  • *"Let me try and get back to you."* - any news on this? – mkl Feb 15 '21 at 19:26
  • yes I am able to split it. based on this(https://stackoverflow.com/questions/61913756/print-the-positions-of-each-tj-and-charecters-inside-tj-tj-in-a-pdf-using-pdfbox) positions I am able to alter the content stream tokens and able to split. – fascinating coder Feb 17 '21 at 06:09
  • Would it be ok for you if we closed this question as duplicate of the other? Or would you prefer writing an answer and explaining how you used those techniques for solving this question? – mkl Feb 17 '21 at 07:51
  • Its not duplicated correct. both are different questions. the above link will help us to do this . – fascinating coder Feb 18 '21 at 05:50

0 Answers0