Deals with extracting useful information from PDF content (for example, text or images)
PDF (Portable Document Format) is a binary format for digital documents. This tag is concerned with parsing these documents, that is to say, extract text, images or other data from them, or convert them to simpler formats (such as plain-text).
Because of the complexity of the PDF format (cf. the specification ISO 32000-1), its parsers are often incomplete (can't extract all information from all documents), and subject to security risks.
For example, pdf-parser is a command-line program that parses and analyses PDF documents. It provides features to extract raw data from PDF documents, like compressed images.
Python Related Options:
- You may extract the table directly using camelot PDF Table Extraction for Humans
- You may treat the pdf directly using tabula
- You may convert the pdf to text using pdftotext, then parse text with python
- You may use external tool, to convert your pdf file to excel or csv, then use required python module to open the excel/csv file.
- You may also convert pdf to an image file, then use any recent OCR software (which reconstruct table automatically from the picture) to get data
- pdf2image with pytesseract and an example.