1

Sample Image I am using below logic to extract text from PDF using PDFBox. It is giving good output for normal PDFs.

PDFTextStripper stripper = new PDFTextStripper();
stripper.setSortByPosition(false);
stripper.setParagraphStart("$");
stripper.setParagraphEnd("$$");
String output = stripper.getText(pdf);

But I have some PDFs in which text is inclined at some angle as shown in the attached image. For this type of PDFs, PDFBox gives output as given below

$ Image proc $$

$ essing is pr $$

$ ocessing of im $$

$ ages usin $$

$ g mathe $$

$ matical $$.....

I want to get output as

$ Image processing is processing of images using 
mathematical...................................................
..........................techniques to the input $$

Please suggest me on how to get good output from these type of PDFs.

Nicolas Filotto
  • 43,537
  • 11
  • 94
  • 122
sagar
  • 115
  • 1
  • 1
  • 10
  • You can't. Either use OCR, or alter the PDF (prepend a rotation to the content stream) to unslant the text. Of course this makes sense only if you know the exact angle of all these PDFs. – Tilman Hausherr Sep 17 '16 at 12:35
  • @TilmanHausherr I tried to unslant the text. By using PDFBox i got the angle of orientation for all characters as 0 degrees. So, i am unaware how to proceed further. – sagar Sep 17 '16 at 12:55
  • Try this for your PDPage: `PDPageContentStream cs = new PDPageContentStream(document, page, PDPageContentStream.AppendMode.PREPEND, true, true); cs.transform(Matrix.getRotateInstance(Math.toRadians(45), 0, 0)); cs.close();` and then save the PDF. Obviously, you have to adjust the angle and maybe use a translation (i.e. change the 0,0). – Tilman Hausherr Sep 17 '16 at 12:59
  • @TilmanHausherr The code works. It's rotating the page to requried angle. Currently I am hardcoding the radian value. Is there any method/code to get the actual text rotated angle. I have tried few things. I am able to get the text rotated angle in the multiples of 90 but not the exact rotated value. – sagar Sep 19 '16 at 06:00
  • continuation to the above comment. For example, in the attached image, the text is inclined at 45 deg, i am getting 0 deg from the below code PDDocument doc = PDDocument.load(pdf); PDFTextStripper stripper = new PDFTextStripper() { @Override protected void processTextPosition(TextPosition text) { System.out.println(text.getDir()); } }; stripper.getText(doc); – sagar Sep 19 '16 at 06:06
  • I don't have code to find the skew, sorry, this is rather a math / computer vision question. I don't understand the second comment "I am getting 0 deg from the below code", the code does text extraction, nothing about detecting skew degrees. – Tilman Hausherr Sep 19 '16 at 06:30
  • @TilmanHausherr Sorry for unclear explanation. Yeah the above is for text extraction, but i override the processTextPosition() method of PDFTextStripper class for getting the angle of every character. getText() method will calls processTextPosition() internally for each character and prints its rotated angle. For example, for the above attached image, the code will prints the angle of rotation for each and every character. – sagar Sep 19 '16 at 09:57
  • oops, yes, `getDir()`, this doesn't seem useful here. You could also try `getTextMatrix().createAffineTransform()`, then get the angle from there (e.g. https://stackoverflow.com/questions/21561909/finding-angle-from-transform-matrix ). But this will work only if the invisible OCR text is rotated too. – Tilman Hausherr Sep 19 '16 at 10:09
  • @TilmanHausherr Thank you. I will try your approach – sagar Sep 20 '16 at 04:47

0 Answers0