8

Searched quite a bit but as I couldn't find a solution for this kind of problem, hence posting a clear question on the same. Most answers cover image/text extraction which are comparatively easier.

I've a requirement of extracting tables and graphs as text (csv) and images respectively from PDFs.

Can anyone help me with an efficient python 3.6 code to solve the same?

Till now I could achieve extracting jpgs using startmark = b"\xff\xd8" and endmark = b"\xff\xd9", but not all tables and graphs in a PDF are plain jpgs, hence my code fails badly in achieving that.

Example, I want to extract table from page 11 and graphs from page 12 as image or something which is feasible from the below given link. How to go about it?

https://hartmannazurecdn.azureedge.net/media/2369/annual-report-2017.pdf

Aakash Basu
  • 1,689
  • 7
  • 28
  • 57
  • Hi Aakash, Curious to know how you managed to accomplish this. Especially identifying/extracting charts and graphs. – qwertynik Nov 12 '21 at 09:01
  • Hi Aakash, I'm in need of the same code, to extract charts from pdf using python code. Did you find any solution? – codelover Apr 27 '22 at 15:41

2 Answers2

1

For extracting tables you can use camelot

Here is an article about it.

For images I've found this question and answer Extract images from PDF without resampling, in python?

milonimrod
  • 306
  • 2
  • 3
  • 4
    Images are more or less done. But the biggest challenges are those graphs aren't images, they're an amalgamation of texts, bars, lines and axises. I really am excited to know how people parse them out from high quality PDFs. – Aakash Basu Apr 29 '19 at 16:44
  • 1
    Getting this error: RuntimeError: Please make sure that Ghostscript is installed. Even though I've installed Ghostscript 9.27. Any help? – Aakash Basu May 20 '19 at 10:42
0

Try using PyMuPdf(https://github.com/pymupdf/PyMuPDF/tree/1.18.3) for amalgamation of texts, bars, lines and axis. It has so many extra utilities.