0

I'm new to python development, is there any way to convert scanned PDF/image to Excel using Python.

I tried with following method:

Step 1: Install pypandoc library,

pip install pypandoc and import library in my code file as same as shown below

import pypandoc
from reportlab.lib.testutils import outputfile

Step 2:

Added below code for conversion to excel

canout = pypandoc.convert_file("DT.pdf", 'excel', outputfile="MyPdf.excel")
assert canout==""

But not able to succeed. Please suggest me for the same implementation.

Note: If any other way of implementation is also welcome.

Thanks

Sharathkumar KG
  • 63
  • 1
  • 2
  • 12
  • Check [this answer](https://stackoverflow.com/questions/17217194/extracting-table-contents-from-a-collection-of-pdf-files/26110587#26110587), it might be useful. – Ketan Mukadam Oct 17 '17 at 07:04
  • you would have received an error pdf is not supported. You may be needing PDFTables – sayth Apr 03 '18 at 08:24

2 Answers2

0

FYI - the CLI version allowed specification of multiple regions of interest, per page. Here 5 areas are specified.

java -jar .\tabula-1.0.2-jar-with-dependencies.jar -p 1 -a 175,140,540,270 -a 175,265,540,390 -a 175,390,540,520 -a 175,510,540,640 -a 175,640,540,780 -o outFile.csv testfile.pdf

jonquille
  • 3
  • 1
0

The digits following the -a are X, Y pixel coordinates that define an area of interest on the page. Imagine laying transparent graph paper on the image, and marking the 4 points at x1=175 y1=140 x2=540 y1=140 and x2=540 and y2=270 x2=540 and y=540. Next, draw the horizontal and perpendicular lines that intersect those points. A bounding box/rectangle will be created. That's the area of interest to be processed.

    |           |
    |           |

----x1,y2------x2,y2----- | | | code will | | look here | | | ----x1,y1------x2,y1----- | | | |

Because there are 4 unique x and y values per region of interest, it's possible to describe the minimum bounding box to the software using 4 values.

jonquille
  • 3
  • 1