I want to extract tables from specific pages of an annual report published by companies.
i am using this Searching text in a PDF using Python? to find page numbers and use tabula to extract csv but as some words are part of other string the page number appears twice and i cant get the right table.
P = "STATEMENT OF PROFIT AND LOSS" #This is on Page 73
S = "CONSOLIDATED STATEMENT OF PROFIT AND LOSS" #This is on Page 131
import tabula
import PyPDF2
import re
pdffile="NCC.pdf"
reader = PyPDF2.PdfFileReader(pdffile)
P = "STATEMENT OF PROFIT AND LOSS" #This is on Page 73
S = "CONSOLIDATED STATEMENT OF PROFIT AND LOSS" #This is on Page 131
for i, page in enumerate(reader.pages, start=1):
text = page.extract_text()
if re.search(P, text):
print("P String Found on Page: " + str(i))
sat=(str(i))
if re.search(S, text):
print("S String Found on Page: " + str(i+1))
con=(str(i))
tabula.convert_into(pdffile,"outputfile.csv",output_format="csv",pages=sat)
Response to the for loop in above code is
P String Found on Page: 73
P String Found on Page: 131
S String Found on Page: 131
the variable created for P String saves 131 instead of 73. hence the response for tabula.convert_into
saves table from page 131 only. wereas i need table from 73.