6

I am having an problem with reading a table from pdf file. It's a very simple pdf file with some text and a table. The tool i am using is itextsharp. I know there is no table concept in PDF. After some googling, someone said it might be possible to achieve that using itextsharp + custom ITextExtractionStrategy. But I have no idea how to start it. Can someone please give me some hints? or a small piece of sample code?

Cheers

kame
  • 20,848
  • 33
  • 104
  • 159
Victor
  • 435
  • 2
  • 8
  • 15
  • As you did not provide a sample PDF, your question can only be answered in general. Thus, if you really only desire to read one specific table (or a specific kind of tables), you might want to provide a sample PDF to get specific answers. – mkl Mar 28 '13 at 11:20
  • See this post and the links within it http://stackoverflow.com/a/7515625/231316 – Chris Haas Mar 28 '13 at 13:02

3 Answers3

3

This code is for reading a table content. all the values are enclosed by ()Tj, so we look for all the values, you can do anything then with the string resulst.

    string _filePath = @"~\MyPDF.pdf";
    public List<String> Read()
    {
        var pdfReader = new PdfReader(_filePath);
        var pages = new List<String>();

        for (int i = 0; i < pdfReader.NumberOfPages; i++)
        {
            string textFromPage = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, pdfReader.GetPageContent(i + 1)));

            pages.Add(GetDataConvertedData(textFromPage));
        }

        return pages;
    }

    string GetDataConvertedData(string textFromPage)
    {
        var texts = textFromPage.Split(new[] { "\n" }, StringSplitOptions.None)
                                .Where(text => text.Contains("Tj")).ToList();

        return texts.Aggregate(string.Empty, (current, t) => current + 
                   t.TrimStart('(')
                    .TrimEnd('j')
                    .TrimEnd('T')
                    .TrimEnd(')'));
    }
gustavohenke
  • 40,997
  • 14
  • 121
  • 129
1

This Code is just for read the PDF file you'll need the

using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;

from the dll itextsharp.dll

var pdfReader = new PdfReader(_filePath);

for (int i = 0; i < pdfReader.NumberOfPages; i++)
{
   var locationTextExtractionStrategy = new LocationTextExtractionStrategy();

   string textFromPage = PdfTextExtractor.GetTextFromPage(pdfReader, i + 1, locationTextExtractionStrategy);

   textFromPage = Encoding.UTF8.GetString(Encoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(textFromPage)));

   //Do Something with the text
}
0

This is a more manual way, but it can be useful.

    /// <summary>
    /// Lê uma tabela de um pdf
    /// </summary>
    /// <param name="pdf">Caminho do PDF</param>
    /// <param name="origemXPag1">Inicio da leitura no eixo X para a primeira página</param>
    /// <param name="origemYPag1">Inicio da leitura no eixo Y para a primeira página</param>
    /// <param name="linhasPag1">Quantidade de linhas da primeira página</param>
    /// <param name="origemXOutrasPag">Inicio da leitura no eixo X para as demais páginas</param>
    /// <param name="origemYOutrasPag">Inicio da leitura no eixo Y para as demais páginas</param>
    /// <param name="linhasOutrasPag">Quantidade de linhas das demais páginas</param>
    /// <param name="alturaLinha">Altrura da linha</param>
    /// <param name="colunas">Nome e largura das colunas</param>
    /// <returns></returns>
    private static List<Dictionary<string, string>> LerTabelaPDF(string pdf, float origemXPag1, float origemYPag1, int linhasPag1, float origemXOutrasPag, float origemYOutrasPag, int linhasOutrasPag, float alturaLinha, Dictionary<string, float> colunas)
    {
        // Primeira página
        float origemX = origemXPag1;
        float origemY = origemYPag1;
        int quantidadeLinhas = linhasPag1;

        var resultado = new List<Dictionary<string, string>>();
        using (PdfReader leitor = new PdfReader(pdf))
        {
            var texto = string.Empty;
            for (int i = 1; i <= leitor.NumberOfPages; i++)
            {
                if (i > 1)
                {
                    origemX = origemXOutrasPag;
                    origemY = origemYOutrasPag;
                    quantidadeLinhas = linhasOutrasPag;
                }
                for (int l = 0; l < quantidadeLinhas; l++)
                {
                    var dados = new Dictionary<string, string>();
                    int c = 0;
                    float deslocamentoX = 0;
                    foreach (var coluna in colunas)
                    {
                        RectangleJ rect = new RectangleJ(origemX + deslocamentoX, origemY + (l * alturaLinha), coluna.Value, alturaLinha);
                        RenderFilter filter = new RegionTextRenderFilter(rect);
                        ITextExtractionStrategy strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
                        texto = PdfTextExtractor.GetTextFromPage(leitor, i, strategy);

                        dados.Add(coluna.Key, texto);
                        c++;
                        deslocamentoX += coluna.Value;
                    }
                    if (dados != null)
                        resultado.Add(dados);
                }
            }
        }
        return resultado;
    }

using:

        var colunas = new Dictionary<string, float>();
        colunas.Add("cod", 20);
        colunas.Add("desc", 300);

        var registros = LerTabelaPDF(pdf, 19, 75, 9, 19, 40, 13, 40, colunas);
        var cod = registros[0]["cod"];
Gustavo Rossi Muller
  • 1,062
  • 14
  • 18
  • 2
    You parse the same page completely many times, each time to extract a different portion of it. This takes much longer than parsing it only once into a regular `LocationTextExtractionStrategy` and then retrieving the contents of those different page regions by calling `LocationTextExtractionStrategy.getResultantText(TextChunkFilter)` with a respectively matching `TextChunkFilter`. In a similar context with iText 7 that switch made the extraction [87.6 times faster](https://stackoverflow.com/questions/48597948?rq=1#comment84346968_48632031)... – mkl Dec 21 '18 at 11:00