Issue with reading in a table from PDF (using PdfReader), columns read in a random order

Question

I am quite new to Python and was given a task to try to write code that would read in a pdf (generated as an output by a scientific instrument) and transform it into a csv. The first page of the pdf contains the following table: table

I wrote the code (please see below), it can extract the data well, but the data is read in a very unusual way: (it reads in first row with two entries first, columns in order 1, 3, 4, 5, 6, 7, 8, 2, 9). read-in output

Does anyone have a suggestion, how I could adjust the code to make it read as a proper table? Or do I need to make the table in the pdf have lines around each cell to make it work?

Many thanks in advance! :)

# importing required modules
import PyPDF2
from PyPDF2 import PdfReader
from datetime import datetime
from camelot.core import Table
import tkinter as tk
import camelot
import pandas as pd
import numpy as np

file = 'test2.pdf'

# creating a pdf reader object
reader = PdfReader(mfi_file)
 
# printing number of pages in pdf file
#print(len(reader.pages))
 
# getting a specific page from the pdf file
page = reader.pages[0]
 
# extracting text from page
text = page.extract_text()

rows = text.split('\n')
for row in rows:
  print(row)

See also: https://stackoverflow.com/questions/47533875/how-to-extract-table-as-text-from-the-pdf-using-python — Martin Thoma, Mar 28 '23 at 14:56

K J · Answer 1 · 2023-03-28T22:05:20.507

This is a common problem with PDF and tabular text

Left is the PDF rendered layout - Middle is raw extraction order - Right is what we expect (from a clipboard output that uses screen layout)

What we need is closest to this (still not perfect due to alignment and missing ≥ symbol)

                      Organ Dose (mGy/MBq)                        E.2.2. Imaging devices:

Adults                Bladder  0.05               0.0077                    - Multiple detector (triple or dual head) or other
  99mTc-ECDa          Kidney   0.034              0.0093                         dedicated small to medium field of view
  99mTc-HMPAOb                                                                   SPECT cameras for brain imaging should be
                      Bladder  0.11               0.022                          used for acquisition since these devices gen-
Children (5 years)    Thyroid  0.14               0.027                          erally produce results superior to those
  99mTc-ECDa                                                                     obtained with single-head cameras.
  99mTc-HMPAOb
                                                                            - Single-detector units may only be used if
a ICRP 106, page 107                                                             the scan time is prolonged appropriately (so
b ICRP 80, page 100                                                              as to reach at least 5 million total detected

slightly better

                     Organ    Dose (mGy/MBq)                                 dedicated small to medium field of view

Adults                                                                       SPECT cameras for brain imaging should be

   99mTc-ECDa        Bladder  0.05            0.0077                         used for acquisition since these devices gen-

   99mTc-HMPAOb      Kidney   0.034           0.0093                         erally produce results superior to those

Children (5 years)                                                           obtained with single-head cameras.

   99mTc-ECDa        Bladder  0.11            0.022                       -  Single-detector units may only be used if

   99mTc-HMPAOb      Thyroid  0.14            0.027                          the scan time is prolonged appropriately (so

                                                                             as to reach at least 5 million total detected

aICRP 106, page 107                                                          events) and meticulous care is taken to

bICRP 80, page 100                                                           produce high-quality images.

so the best command is pdftotext -layout

and we now have a perfect working input to start parsing in python as text

For Windows Command Line Instructions (CLI) to convert to CSV its a second command line or so, for a simple case it could be (different source) where we desire to abbreviate column headers and at the same time, exclude the x symbol/column

@pdftotext -nopgbrk -f 1 -l 1 -layout -x 290 -y 530 -W 300 -H 300 cut-sample.pdf out.txt
@echo Pc,W,H,Q,C>out.csv&for /f "usebackq tokens=1,2,4,5,6 delims= " %%f in ("out.txt") do @echo %%f,%%g,%%h,%%i,%%j >>out.csv
@echo/&type out.csv

However for this type of layout we need a different approach depending on how much is wanted,

@echo off
echo "Range (µm)",2.00,≥ 10.00,≥ 25.00,≥ 50.00,≥ 100.00 >out.csv
echo " ",^< 10.00,^< 1751.00,^< 1751.00,^< 1751.00,^< 1751.00 >>out.csv
for /f "usebackq tokens=1,2,3,4,5,6,7 delims= " %%A in ("particles.txt") do (
if "%%A"=="Count" echo %%A,%%B,%%C,%%D,%%E,%%F >>out.csv
if "%%A"=="-" echo " ",%%A,%%B,%%C,%%D,%%E >>out.csv
if "%%A"=="Concentration" echo %%A %%B,%%C,%%D,%%E,%%F,%%G >>out.csv
)
echo/&"C:\Apps\Microsoft Office\Office\Excel" out.csv

However note that csv as 7/8bit text and UTF-8 symbols do not work well without a pure 8bit format. The CSV is perfect 8 bits WinAnsi and Notepad can save as UTF-8, but older Excel expects Ascii (lower 7 bits of Ansi)

Issue with reading in a table from PDF (using PdfReader), columns read in a random order

1 Answers1

Linked