0

I have a pdf like this

pdf

but the extracted goes like this

Pocket,Light
Olive,M,469168485002
1475.63 364.23 1111.40 133.3761051020 2299.0

But i want the result with the preserved multi spaces present in the document

Hoodie with Kangaroo
Pocket,Light              61051020     2299.0     1475.63       364.23      1111.40       133.37
Olive,M,469168485002

I am expecting to get the result in preserved spaces that is present in pdf

Olaf Kock
  • 46,930
  • 8
  • 59
  • 90
iiprateek
  • 1
  • 1
  • 2
    Most likely there are no spaces (as in drawn space characters) there, the text insertion point simply is repositioned. What you are looking for is text extraction that tries to keep the layout. – mkl Apr 06 '23 at 20:07
  • I just want to preserve the format of pdf using Apache pdf box with java. – iiprateek Apr 07 '23 at 17:10
  • Maybe you should use tabula java instead, that one can extract tables. – Tilman Hausherr Apr 07 '23 at 17:21
  • @mkl yes, the character has its own coordinates that pdf renderer uses to draw text. What will be my algorithm to draw white spaces. To make that it will show space. Blank string length of width of pdf. Then calculate where the single character will take place. – iiprateek Apr 09 '23 at 12:58
  • You may want to look at the proof-of-concept [in this answer](https://stackoverflow.com/a/45842515/1729265) (unfortunately based on PDFBox 1.8.x) or [this github project](https://github.com/JonathanLink/PDFLayoutTextStripper) based on PDFBox 2.x. – mkl Apr 13 '23 at 21:15
  • I have used the github project once. If you see carefully the picture no 2 (Data extraction from a form in a PDF file) the spaces are not formatted correctly. For one space It have 2 whitespaces in text extract. In that project first he is trying to find new line on the basis of Y position. If the position of current `TextPosition` object is greater than previous text position. Then a new line is formed. He just created the blank array of width of pdf/4 and with the location X position he is trying to fit the text into the text extract. He is also checking XprevChar + prevCharWidth == currXpos – iiprateek Apr 16 '23 at 17:53

0 Answers0