How to extract charts/tables/graphs from PDF files using Python?

Question

Searched quite a bit but as I couldn't find a solution for this kind of problem, hence posting a clear question on the same. Most answers cover image/text extraction which are comparatively easier.

I've a requirement of extracting tables and graphs as text (csv) and images respectively from PDFs.

Can anyone help me with an efficient python 3.6 code to solve the same?

Till now I could achieve extracting jpgs using startmark = b"\xff\xd8" and endmark = b"\xff\xd9", but not all tables and graphs in a PDF are plain jpgs, hence my code fails badly in achieving that.

Example, I want to extract table from page 11 and graphs from page 12 as image or something which is feasible from the below given link. How to go about it?

https://hartmannazurecdn.azureedge.net/media/2369/annual-report-2017.pdf

Hi Aakash, Curious to know how you managed to accomplish this. Especially identifying/extracting charts and graphs. — qwertynik, Nov 12 '21 at 09:01
Hi Aakash, I'm in need of the same code, to extract charts from pdf using python code. Did you find any solution? — codelover, Apr 27 '22 at 15:41

score 1 · Answer 1 · answered Apr 29 '19 at 08:23

1

For extracting tables you can use camelot

Here is an article about it.

For images I've found this question and answer Extract images from PDF without resampling, in python?

answered Apr 29 '19 at 08:23

milonimrod

306
2
3

4

Images are more or less done. But the biggest challenges are those graphs aren't images, they're an amalgamation of texts, bars, lines and axises. I really am excited to know how people parse them out from high quality PDFs. – Aakash Basu Apr 29 '19 at 16:44
1

Getting this error: RuntimeError: Please make sure that Ghostscript is installed. Even though I've installed Ghostscript 9.27. Any help? – Aakash Basu May 20 '19 at 10:42

score 0 · Answer 2 · answered Nov 20 '20 at 07:04

0

Try using PyMuPdf(https://github.com/pymupdf/PyMuPDF/tree/1.18.3) for amalgamation of texts, bars, lines and axis. It has so many extra utilities.

answered Nov 20 '20 at 07:04

rameshreddy kv

1
2

check extract-graphics in this( https://github.com/pymupdf/PyMuPDF-Utilities) – rameshreddy kv Nov 23 '20 at 12:59

How to extract charts/tables/graphs from PDF files using Python?

2 Answers2

Linked