0

I would like to extract text, including tables from pdf file.

I tried camelot, but it can only get table data not text.

I also tried PDF2, however it can't read Chinese characters.

Here is the pdf sample to read.

Are there any recommended text-extraction python packages?

Michael M.
  • 10,486
  • 9
  • 18
  • 34
Chan
  • 3,605
  • 9
  • 29
  • 60
  • [pdfminer.six](https://anaconda.org/conda-forge/pdfminer.six/files) from `conda-forge` is pretty good, go to the files tab and grab the tarball that matches your system (windows, linux, mac) – C.Nivs Feb 26 '19 at 03:52
  • https://stackoverflow.com/questions/50985619/how-to-read-pdf-files-which-are-in-asian-languages-chinese-japanese-thai-etc – bumblebee Feb 26 '19 at 05:05
  • There is a Python wrapper for PDFNet here https://github.com/PDFTron/PDFNetWrappers online demo is here https://www.pdftron.com/pdf-tools/pdf-table-extraction/ – Ika Feb 27 '19 at 15:44

1 Answers1

0

By far the simplest way is to extract text in one OS shell command line using the poppler pdf utility tools (often included in python libraries) then modify that output in python.py as required.

>pdftotext -layout -f 1 -l 1 -enc UTF-8 sample.pdf

NOTE some of the text is embeded to right of the logo image and that can be extracted separately using pdftoppm -png or pdfimages then pass to inferior output quality OCR tools for those smaller areas.

enter image description here

K J
  • 8,045
  • 3
  • 14
  • 36