Extract complete table from PDF using tabula in python

Question

I have a PDF with the table in the below format, column names and data are separated by "--------"

col1 col2 col3 col4 col5 col6 col7 col8 col9 col10 col11 col12 col13
----------------------------------------------------------------------
B    ABC1      F1  SSSSSS 1   32WE 161A1 1     A   DU23   162.00 85
C    ABC2      F2  DDDDDD 1   WE32 161B1 1     B   DU20   162.00 86
C    ABC3      F3  FFFFFF 1   DF45 161C1 1     C   DU20   162.00 87

current code:

import tabula
df = tabula.read_pdf("example.pdf", pages='all')

df is list of dataframes of all tables in the pdf

I was able to extract the table content using tabula but maybe because of the table format it ignores the column names and shows the first row of the table as column names. How can I get the column names? Also col3 is empty, tabula ignores this column completely. How can i extract the complete table with column names including empty columns

I am not sure if this would work, but if I remove the "-----------" from the table, I believe tabula would be able to read the table correctly.But, I am not sure how to delete "------------" from pdf. I am trying to extract data from pdf using pypdf2 but not able to change the content.

Code :

import PyPDF2
from PyPDF2 import PdfFileWriter, PdfFileReader

# creating a pdf file object 
pdfFileObj = open('example.pdf', 'rb') 

# creating a pdf reader object 
pdfReader = PyPDF2.PdfFileReader(pdfFileObj) 

# printing number of pages in pdf file 
i=0
pageData=''
for i in range(0,pdfReader.getNumPages()):
    # creating a page object 
    pageObj = pdfReader.getPage(i) 
    # extracting text from page 
    pageData = pageObj.extractText()
    print(pageData)

    #Modify PDF here
    #remove "-----" from the extracted text
    # rewrite modified text back to pdf

    i = i+1

# closing the pdf file object 
pdfFileObj.close()

Try SLICEmyPDF in 1 of the answers at https://stackoverflow.com/questions/56017702/how-to-extract-table-from-pdf-in-python/72414309#72414309 — 123456, May 29 '22 at 10:22
It looks like the question here is how to remove a line from a pdf. Maybe you can focus on that part? — Martin Thoma, Jul 30 '22 at 10:53

Extract complete table from PDF using tabula in python

0 Answers0