0

Im working on a pdfreader. But i want to differ between a real new line or just a paragraph break (caused by missing space). The problem is even the new line belongs to the paragraph it adds an \n.

Here is some code i already tried.

    public string GetContent(int page = 1)
    {
        using (var pdfReader = new PdfReader(Path))
        {
            ITextExtractionStrategy strategy = new LocationTextExtractionStrategy();
            //ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();

            //iTextSharp.text.Rectangle rect = new iTextSharp.text.Rectangle(0, 0, 612, 792);
            //RenderFilter[] renderFilter = new RenderFilter[1];
            //renderFilter[0] = new RegionTextRenderFilter(rect);
            //ITextExtractionStrategy textExtractionStrategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), renderFilter);

            var currentText = PdfTextExtractor.GetTextFromPage(pdfReader, 1, strategy);

            currentText =
                Encoding.UTF8.GetString(Encoding.Convert(
                    Encoding.Default,
                    Encoding.UTF8,
                    Encoding.Default.GetBytes(currentText)));

            return currentText;
        }
    }
NinjaOnSafari
  • 998
  • 1
  • 8
  • 32
  • Unless your PDF is a tagged PDF, there are no paragraphs in your PDF document. You may see paragraphs with your human eyes, but to a machine, there are only lines, lines and lines. A machine doesn't know which line is a title, which line belongs to a paragraph and which line doesn't belong to a paragraph. Read more about this in [this article](http://www.openhealthnews.com/articles/2014/using-open-source-pdf-technology-solve-unstructured-data-problem-healthcare). (In short: your question is wrong because you are making wrong assumptions.) – Bruno Lowagie Aug 24 '15 at 13:13
  • @BrunoLowagie so you're saying there is no way the get this information? – NinjaOnSafari Aug 24 '15 at 13:25
  • You could measure the length of each line and assume that short lines indicate the end of a paragraph. This is demonstrated in [this video](https://www.youtube.com/watch?v=lZnbhnU4m3Y). – Bruno Lowagie Aug 24 '15 at 13:54
  • 2
    Not related to your actual question but you should just completely remove the `currentText = Encoding.UTF8...` line because it doesn't do what you want it to do. See [this]()http://stackoverflow.com/a/10191879/231316 for more on that. – Chris Haas Aug 24 '15 at 14:35
  • @BrunoLowagie do you know if the source code from those examples/demos got publicized? – NinjaOnSafari Aug 24 '15 at 15:13
  • It was a project for a customer, hence: no, the source code isn't public. – Bruno Lowagie Aug 24 '15 at 16:07
  • is there elsewhere some documentation or examples for this topic? – NinjaOnSafari Aug 25 '15 at 08:32
  • @BrunoLowagie are you familar with the code? – NinjaOnSafari Aug 25 '15 at 08:46

0 Answers0