0

i have a pdf file stored in a server url, and i want to get each line of the file, i want later export it to an excel file so i need to get every line, one by one, i will put the code here. OBS: the url of the pdf stop working after 3 hours, i will always update it here in the comments. thanks.

using System;
using System.Net.Http;
using System.Threading.Tasks;
                    
    public class Program
    {
        public static async Task Main()
        {
                var pdfUrl = "https://eproc.trf4.jus.br/eproc2trf4/controlador.php?acao=acessar_documento_implementacao&doc=41625504719486351366932807019&evento=20084&key=4baa2515293382eb41b2a95e121550490b5b154f1c4c06e8b0469eff082311e6&hash=3112f8451af24a1a5c3e69afab09f079&termosPesquisados=";
                var client = new HttpClient();
                var response = await client.GetAsync(pdfUrl);
    
                using (var stream = await response.Content.ReadAsStreamAsync())
                {
                    Console.WriteLine("print each line of my pdf file");
                }
        }
    }
Gouveia
  • 29
  • 6
  • i am using .net online fidlle to test this code: https://dotnetfiddle.net/ – Gouveia Jun 27 '22 at 17:12
  • 2
    That's a big stretch to assume PDF files have of concept of what a "line" is. Reality is they don't often work that way at all. – Joel Coehoorn Jun 27 '22 at 17:13
  • You'll need to either make your own PDF parser and learn the structure of PDFs, or find a library that is able to help you here. PDFSharp: http://www.pdfsharp.net/wiki/PDFsharpSamples.ashx – Ibrennan208 Jun 27 '22 at 17:16
  • 1
    PDF files aren't plain-text files. You'll need to parse it apart *somehow* and determine what your "lines" are. It's going to require quite a bit more code than simply writing out what you're reading in. – Broots Waymb Jun 27 '22 at 17:18
  • ok thanks, but how do i read the content from the pdf? the function i am using is showin me bite values, not string or text. – Gouveia Jun 27 '22 at 17:33
  • the http return is coming as memorystream, if you have a way to convert it to a string and you can view the data, it would already be a step forward – Gouveia Jun 27 '22 at 18:55
  • 2
    "`The function i am using is showin me bite values`". Well, yeah. PDF files **ARE** bit/byte values. You can see this by opening the file in a pure text editor like notepad. The text is not present as such in the file. Instead you have a bunch of binary formatting data for page settings, font data, textarea definitions, embedded images, etc, that occasionally includes a snippet of character data here and there. Even when the text is included, it won't necessarily show when you view the raw file data, because the text might not start on a byte boundary. – Joel Coehoorn Jun 27 '22 at 19:21
  • 2
    TLDR; pdf files are a list of instructions to draw a page. Which is probably full of rectangles containing text. At best you can compare these rectangles to work out if they are close enough to be considered the same line. And lines that are close enough to look like paragraphs. Though there are tools which try to do this, none will be perfect for every possible pdf document. – Jeremy Lakeman Jun 28 '22 at 00:59

2 Answers2

0

Well, extracting text from PDF is not an ordinary task. If you need really generic solution works with any pdf, then state of art solution here is to use AI based API provided for example by some cloud platforms like Google, AWS or Azure:

https://cloud.google.com/vision/docs/pdf

https://learn.microsoft.com/en-us/azure/applied-ai-services/form-recognizer/

https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/automatically-extract-content-from-pdf-files-using-amazon-textract.html

So, read pdf as bytes, send bytes to external AI based API, receive parsed content back.

Of course, you will need to do some preparation to use cloud services mentioned above and also it costs some money

Alexey Gorbel
  • 220
  • 3
  • 11
  • 1
    As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Jun 28 '22 at 12:09
0

How can I best explain why you need a pdf decompressor like pdftotext, is that, the first line when decoded by an app (this is not the raw byte stream) comes in three separate parts. Luckily as whole word strings (they do not need to) and also luckily in this case from the same ascii font table.

BT /F1 12.00 Tf ET
BT 42.52 793.70 Td (Espelho de Valores Atualizados.) Tj ET
BT /F1 12.00 Tf ET
BT 439.37 793.70 Td (Data: ) Tj ET
BT 481.89 793.70 Td (05/07/2021) Tj ET 

so we can easily see when converted into ascii that all three parts are at level 793.70 thus a lib can assume they are one line with only 3 different offsets, hence you need a 3rd party lib to decode and reassemble a line of text as if it is just one line string. That requires first save pdf as file, parse the whole file into several common encodings like ascii, hex and UTF-16 mixed (there is generally no UTF-8) then save those as a plain text file with UTF-8 encoding, Then you can extract the UTF-8 lines as required.

Unclear what format of line output you are hoping for since a PDF does not have numbered lines, however if we allocate numbers to lines with text (and some without) based on Human concept of Layout we can run a few lines using poppler utils and native OS text parsing. Here Cme could have loops and arguments, but hardcoded for demonstration. Note the console output would need local chcp but the text file is good

Poppler\poppler-22.04.0\Library\bin>Cme.bat |more

@curl -o brtemp.pdf "https://eproc.trf4.jus.br/eproc2trf4/controlador.php?acao=acessar_documento_implementacao&doc=41625504719486351366932807019&evento=20084&key=c6c5f83e942a3ee021a874f6287505c1cb484235935ff1305c6081893e3481b1&hash=922cacb9024f200d13d3f819e2e906f4&termosPesquisados="
@pdftotext -f 1 -l 1 -nopgbrk -layout -enc UTF-8 brtemp.pdf page1.txt
@pdftotext -f 2 -l 2 -nopgbrk -layout -enc UTF-8 brtemp.pdf page2.txt
@find /N /V "Never2BFound" page1.txt
@find /N /V "Never2BFound" page2.txt

responds

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  3749  100  3749    0     0   4051      0 --:--:-- --:--:-- --:--:--  4052

---------- PAGE1.TXT
[1]Espelho de Valores Atualizados.                                    Data:   05/07/2021
[2]

enter image description here

Page 1.txt

Espelho de Valores Atualizados.                                    Data:   05/07/2021

PROCESSO         : 5018290-57.2021.4.04.9388
ORIGINÁRIO       : 5002262-05.2018.4.04.7000/PR
TIPO             : Precatório

REQUERENTE       : ERCILIA GRACIE RIBEIRO
ADVOGADO         : ANA PAULA HORIGUCHI - PR064269

REQUERIDO  : INSTITUTO NACIONAL DO SEGURO SOCIAL - INSS
PROCURADOR : PROCURADORIA REGIONAL FEDERAL DA 4 REGIÃO - PRF4

DEPRECANTE       : Juízo Substituto da 10ª VF de Curitiba

etc.....

K J
  • 8,045
  • 3
  • 14
  • 36
  • nice work, but i need to load thses data inside .net console application, i can not use cmd, but your example is awesome – Gouveia Jun 28 '22 at 00:37