I have a PDF with the table in the below format, column names and data are separated by "--------"
col1 col2 col3 col4 col5 col6 col7 col8 col9 col10 col11 col12 col13
----------------------------------------------------------------------
B ABC1 F1 SSSSSS 1 32WE 161A1 1 A DU23 162.00 85
C ABC2 F2 DDDDDD 1 WE32 161B1 1 B DU20 162.00 86
C ABC3 F3 FFFFFF 1 DF45 161C1 1 C DU20 162.00 87
current code:
import tabula
df = tabula.read_pdf("example.pdf", pages='all')
df is list of dataframes of all tables in the pdf
I was able to extract the table content using tabula but maybe because of the table format it ignores the column names and shows the first row of the table as column names. How can I get the column names? Also col3 is empty, tabula ignores this column completely. How can i extract the complete table with column names including empty columns
I am not sure if this would work, but if I remove the "-----------" from the table, I believe tabula would be able to read the table correctly.But, I am not sure how to delete "------------" from pdf. I am trying to extract data from pdf using pypdf2 but not able to change the content.
Code :
import PyPDF2
from PyPDF2 import PdfFileWriter, PdfFileReader
# creating a pdf file object
pdfFileObj = open('example.pdf', 'rb')
# creating a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
# printing number of pages in pdf file
i=0
pageData=''
for i in range(0,pdfReader.getNumPages()):
# creating a page object
pageObj = pdfReader.getPage(i)
# extracting text from page
pageData = pageObj.extractText()
print(pageData)
#Modify PDF here
#remove "-----" from the extracted text
# rewrite modified text back to pdf
i = i+1
# closing the pdf file object
pdfFileObj.close()