2

I am working on convert PDF to text. I can get text from PDF correctly but it is being complicated in table structure. I know PDF doesn't support table structure but I think there is a way get cells correctly. Well, for example:

I want to convert to text like this:

> This is first example.

> This is second example.

But, when I convert PDF to text, theese datas looking like this:

> This is This is

> first example. second example.

How can I get values correctly?

--EDIT:

Here is how did I convert PDF to Text:

OpenFileDialog ofd = new OpenFileDialog();
        string filepath;
        ofd.Filter = "PDF Files(*.PDF)|*.PDF|All Files(*.*)|*.*";

        if (ofd.ShowDialog() == DialogResult.OK)
        {
            filepath = ofd.FileName.ToString();

            string strText = string.Empty;
            try
            {
                PdfReader reader = new PdfReader(filepath);

                for (int page = 1; page < reader.NumberOfPages; page++)
                {
                    ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
                    string s = PdfTextExtractor.GetTextFromPage(reader, page, its);

                    s = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(s)));
                    strText += s;
                }
                reader.Close();
             }
             catch (Exception ex)
            {
                MessageBox.Show(ex.Message);
            }
        }
pseudocode
  • 209
  • 1
  • 6
  • 17
  • Can you please share the code you are using to retrieve that text? – Bassie Dec 02 '16 at 10:18
  • @Bassie Thanks, I updated my post. – pseudocode Dec 02 '16 at 10:24
  • Doesn't look like this is possible by default, check this for a possible solution: http://stackoverflow.com/questions/7513209/using-locationtextextractionstrategy-in-itextsharp-for-text-coordinate/7515625#7515625 – Bassie Dec 02 '16 at 11:00
  • 1
    Are you able to provide a sample pdf? – Bassie Dec 02 '16 at 11:06
  • 2
    You use the `LocationTextExtractionStrategy` which arranges all text it finds in left-to-right lines from top to bottom. You will need something different here. Depending on your PDFs the `SimpleTextExtractionStrategy` might do. – mkl Dec 02 '16 at 11:52
  • @mkl Thanks for answer, I changed LocationTextExtractionStrategy with SimpleTextExtractionStrategy, then it worked. – pseudocode Dec 05 '16 at 05:13
  • 1
    Please be aware, this is not a solution for all documents. The `SimpleTextExtractionStrategy` simple takes the strings in the order they are drawn. This *can* be the desired order but it also can appear completely random. – mkl Dec 05 '16 at 06:22

1 Answers1

5

To make my comment an actual answer...

You use the LocationTextExtractionStrategy for text extraction:

ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
string s = PdfTextExtractor.GetTextFromPage(reader, page, its);

This strategy arranges all text it finds in left-to-right lines from top to bottom (actually also taking the text line angle into account). Thus, it clearly is not what you need to extract text from tables with cells with multi-line content.

Depending on the document in question there are different approaches one can take:

  • Use the iText SimpleTextExtractionStrategy if the text drawing operations in the document in question already are in the order one wants for text extraction.
  • Use a custom text extraction strategy which makes use of tagging information if the document tables are properly tagged.
  • Use a complex custom text extraction strategy which tries to get hints from text arrangements, line paths, or background colors to guess the table cell structure and extract text cell by cell.

In this case, the OP commented that he changed LocationTextExtractionStrategy with SimpleTextExtractionStrategy, then it worked.

mkl
  • 90,588
  • 15
  • 125
  • 265
  • how do i get forecolor and backcolor of every row/cell ? Please tell me. – Manish Jain May 19 '19 at 08:48
  • @ManishJain *"how do i get forecolor and backcolor of every row/cell ? Please tell me."* - This is an entirely different question. Thus, please ask it as a stack overflow question in its own right, not a mere comment. But a few words to start with: unless the pdf is accordingly tagged, there is no hint in it that certain areas are cells other than some lines or color filled rectangles or a regular arrangement of text pieces. Thus, please clarify in your question how in your pdfs cells can be recognized. If you don't know technically, please share representative examples. – mkl May 19 '19 at 09:48