Python - Can I convert values from a pdf to a DataFrame?

Question

I am trying to convert the values from a PDF into a pandas DataFrame that can be manipulated in Python.

I have attached a photo that shows how I currently do it, as well as a sample PDF. Thanks in advance

I tried a solution from someone who wanted something similar, but since I want to return a dataframe that is at the bottom and it is not a table, it did not work for me.

Asking for generic help in this way is outside the scope of Stackoverflow.com and leads to biased answers like you have received. While both answers may be ok, they both also show this issue. There are 100 other ways and neither answer provided is correct. — Kevin Brown, Mar 28 '23 at 06:13

score 0 · Answer 1 · answered Mar 27 '23 at 14:18

0

I couldn't make a comment to your question because I don't have the reputation to but you can definitely check out the tabula-py project to tabulate your data. Here is a link for installation and documentation.

Since your tables are formatted quite neatly, the functions should be able to recognize the data without too much trouble. I'd be happy to try and look through any code you're having problems with as you try to tabulate the data.

answered Mar 27 '23 at 14:18

HarunCelikOtto

13
5

I tried to use the tabula-py, but I couldn't find the coordinates well (the values I put there were taken from the google element analysis tool) `import tabula area = [1378, 740, 2080, 1220] tabula.convert_into("/content/sample.pdf", "output.csv", output_format="csv", pages='all', area=area) from google.colab import files files.download('output.csv')` – Crazy Apple Mar 27 '23 at 16:46
I don't know which OS you're on but if you are on Windows, I have used [Sumatra PDF reader](https://www.sumatrapdfreader.org/download-free-pdf-viewer) in the past with good success. If you open the pdf and hit "m" on your keyboard you should be able to see a mouse cursor position in the pdf which you can use. – HarunCelikOtto Mar 27 '23 at 18:28

K J · Accepted Answer · 2023-03-29T14:54:52.143

The best way is to pre-process before manipulate so here I can simply convert pdftotext then call that in notepad or excel and using excel vba that could all be done without python OR for your use you can edit using python the text into csv by add the commas in the desired columns as per the way excel does it.

either way its just one line to call on multiple files.

list,of al,l pieces:,,,
,Piece,Widt x,Hei,Q,ty Description
,58,762 x,582,2,@5
,70,762 x,582,2,@5
,16,70 x,564,4,@8
,67,70 x,1250,4,@8
,59,1250 x,582,1,@5
,71,1350 x,582,1,@5
,77,762 x,582,1,@5
,28,744 x,70,1,@8
,44,194 x,70,1,@8
,84,802 x,280,3,@2

so depending on how you clean your text you can do much better than above raw single line output as we don't need excel either

@pdftotext -nopgbrk -f 1 -l 1 -layout -x 290 -y 530 -W 300 -H 300 cut-sample.pdf out.txt
@echo Pc,W,H,Q,C>out.csv&for /f "usebackq tokens=1,2,4,5,6 delims= " %%f in ("out.txt") do @echo %%f,%%g,%%h,%%i,%%j >>out.csv
@echo/&type out.csv

Here I have not allowed for different size or positions of tables so, if necessary, you can move that "window" of interest up and to left and wider and taller then simply extract any line that includes @ as those are always in this OP example.

For a more complex "if this then that" CSV output see https://stackoverflow.com/a/75856112/10802527

Seriously, I found the tool very useful, I really didn't know it existed, I'm using the space as the limiter and I eliminate the column that contains the "x" after the Widt — Crazy Apple, Mar 27 '23 at 16:51

Python - Can I convert values from a pdf to a DataFrame?

2 Answers2