How to detect newline from PDF using iTextSharp

Question

I have used getbaseline[vector.I2] for calculating subscript and superscript. By doing this I'm not able to extract newline from PDF. Can you please suggest to me how to get newline from PDF using iTextSharp?

In essence you have to recognize small Vertical differences as subscript and superscript, and larger ones as newlines. Pdfs which used text rise operators for subscript and superscript make this easier still. — mkl, Mar 16 '13 at 10:57
Please describe what you have done yet in more detail, best with some code. As you say you already use the base line for subscript and superscript detection, you already seem to be half there. — mkl, Mar 16 '13 at 12:02

mkl · Accepted Answer · 2013-03-18T11:25:57.837

The code you supplied isn't completely self-explanatory. Thus I make some assumptions, foremost that your code is some excerpt of the RenderText(TextRenderInfo) method of a RenderListener implementation, probably some extension of the SimpleTextExtractionStrategy with added member variables lastBaseLine, firstcharacter_baseline, lastFontSize, and lastFont.

This implies that you only seem to be interested in documents in which text occurs in the content stream in reading order; otherwise you would have based your code on the LocationTextExtractionStrategy or a similar base algorithm.

Furthermore I don't understand some of your if statements which are either always false or always true, or the code body for which is empty. Nor is clear what text_second is good for, or why you calculate difference = curBaseline[Vector.I2] - curBaseline[Vector.I2] in one place.

All this being said, your initial if statement seems to test whether or not the vertical base line position of the new text is different from that of the text before. Thus, this is where you could also spot the start of a new line.

I would propose that you start not only storing the last base line but also the last descent line, which according to the docs is the line that represents the bottom most extent that a string of the current font could have, and compare it with the current ascent line (by the docs the line that represents the topmost extent that a string of the current font could have).

If the ascent line of the current text is below the descent line of last text, that should mean that we have a new line, it's too far down for a subscript. In code, therefore:

[...]
else if (curBaseline[Vector.I2] < lastBaseLine[Vector.I2])
{
    if (curAscentLine[Vector.I2] < lastDescentLine[Vector.I2])
    {
        firstcharacter_baseline = character_baseline;
        this.result.Append("<br/>");
    }
    else
    {
        difference = firstcharacter_baseline - curBaseline[Vector.I2];
        text_second.SetTextRise(difference);

        if (difference == 0)
        {
        }
        else
        {
            SupSubFlag = 2;
        }
    }
}
[...]

As you expect the text in the content stream to occur in reading order, you can also try to recognize a new line by comparing the Vector.I1 coordinates of the end of the base line of the last text and the start of the base line of the new text. If the new one is a relevant amount less than the old one, this looks like a carriage return hinting at a new line.

The code, of course, will run into trouble in a number of situations:

Whenever your expectation that the text in the content stream occurs in reading order, is not fulfilled, you'll get garbage all over.
When you have multicolumnar text, the test above won't catch the line break between the bottom of one column and the top of the next. To also catch this, you might want to check (analogouly to the proposed check for a jump a line down) whether the new text is way above the last text, comparing the last ascent line with the new descent line.
If you get PDFs with very densely packed text, lines might overlap with superscript and subscript of surrounding lines. In this case you will have to fine tune the comparisons. But here you will definitively run into falsely detected breaks sometimes.
If you get PDFs with rotated text, you'll get garbabr all over.

Do i need to change the code to detect subscript from pdf. is there any other method to detect and newline please suggest me ? — Pragya, Mar 19 '13 at 03:51
Please look at the code in my answer. It's meant to replace but one of your `else if ` blocks and it still contains the ` SupSubFlag = 2` line. Thus, if you replace that code block correctly, your subscript detection still functions. only the case of a large vertical jump is treated differently now. — mkl, Mar 19 '13 at 05:41

score 0 · Answer 2 · answered Mar 16 '13 at 05:46

0

You can use

Document.Add(new Phrase(Environment.NewLine));

OR

  // add line below title
  LineSeparator line = new LineSeparator(1f, 100f, BaseColor.BLACK, Element.ALIGN_CENTER, -1);
  document.Add(new Chunk(line));

answered Mar 16 '13 at 05:46

Amol

1,431
2
18
32

thank you for replying soon ...while reading text from pdf the text is extracted without newline.the nextline contents are displayed in same line without newline. i think the above code is for creating PDF.while getting content from PDF i want the text to be seperated by linebreak.can anyone please help me out ? Thank You in Advance – Pragya Mar 16 '13 at 06:37
see [This](http://bytescout.com/products/developer/pdfextractorsdk/find-text-and-get-coordinates-pdf#aspnet) – Amol Mar 16 '13 at 06:52

How to detect newline from PDF using iTextSharp

2 Answers2

Linked