find page no with specific string using pypdf2 and re

Question

I want to extract tables from specific pages of an annual report published by companies.

i am using this Searching text in a PDF using Python? to find page numbers and use tabula to extract csv but as some words are part of other string the page number appears twice and i cant get the right table.

P = "STATEMENT OF PROFIT AND LOSS" #This is on Page 73

S = "CONSOLIDATED STATEMENT OF PROFIT AND LOSS" #This is on Page 131

import tabula
import PyPDF2
import re

pdffile="NCC.pdf"

reader = PyPDF2.PdfFileReader(pdffile)

P = "STATEMENT OF PROFIT AND LOSS"   #This is on Page 73
S = "CONSOLIDATED STATEMENT OF PROFIT AND LOSS"   #This is on Page 131

for i, page in enumerate(reader.pages, start=1):
    text = page.extract_text()
    if re.search(P, text):
     print("P String Found on Page: " + str(i))
     sat=(str(i))
    if re.search(S, text):
     print("S String Found on Page: " + str(i+1))
     con=(str(i))

tabula.convert_into(pdffile,"outputfile.csv",output_format="csv",pages=sat)

Response to the for loop in above code is

P String Found on Page: 73
P String Found on Page: 131
S String Found on Page: 131

the variable created for P String saves 131 instead of 73. hence the response for tabula.convert_into saves table from page 131 only. wereas i need table from 73.

Try to break your problem down. The pdf handling part seems to work as expected, right? How could you change the code in the question to remove the pdf part? — Martin Thoma, Jun 30 '22 at 22:00
Sorry i didnt explain it correctly, I have edited it a little. I do get the page numbers but i want 73 not 131. through `re.search("^statement")` i thought i could only get response for sentence starting with statement but that doesnt work either or maybe i am using the regex wrong. — Shashi, Jul 01 '22 at 05:11
Thanks KJ, i will try that. Also Martin the P that i want is on 73 but as P is also part of S the result shows as if its in S as well. But thanks i think i have got the hang of it from KJ will try that — Shashi, Jul 04 '22 at 06:25

find page no with specific string using pypdf2 and re

0 Answers0