-3

itextsharp and pdfbox in both i am able to extract the text character, but there alignment is not same as pdf file alignment,(margin left,top etc)

How can i keep the pdf alignment in txt file also?

rajeevkbc
  • 17
  • 4
  • The `LayoutTextExtractionStrategy` in [this answer](https://stackoverflow.com/a/46585997/1729265) might give you a bit of what you are looking for. That you hardly will get more, should be clear from @Bruno's answer. – mkl Jan 02 '18 at 21:32
  • As you've meanwhile revisited your question, have you had a look at the `LayoutTextExtractionStrategy`? – mkl Jan 04 '18 at 09:55
  • You should accept my answer instead of posting a new, duplicate question that gets closed and deleted. – Bruno Lowagie Jan 05 '18 at 10:09

1 Answers1

3

As you've experienced when experimenting with both iText and PdfBox, you are asking something that is impossible because of a mismatch between the way the Portable Document Format defines a layout and the way layout is established in the plain text format.

  • In .txt files, alignment, indentation, spacing,... is achieved using white space characters, such as spaces (), newline characters (/n). and tabs (/t).
  • In .pdf files, single space characters are often used in-between words, but when more than one space is needed, or in cases when word-spacing is optimized for a better reading experience, you'll see that absolute positioning is preferred over using space characters. The \n in a content stream isn't perceived as a new line for the content, but the concept of a new line exists through new line operators. The concept of a tab doesn't exist at all in PDF; absolute positioning using (x, y) coordinates is used instead.

Your expectation that a conversion process from PDF to TXT could somehow solve this syntactical mismatch is endearing, but it starts from an assumption that is totally wrong: you'd need absolute positioning functionality in the plain text format, and that functionality simply isn't there. The answer to your question is that there is no answer.

Bruno Lowagie
  • 75,994
  • 9
  • 109
  • 165