0

I want to extract tables from specific pages of an annual report published by companies.

i am using this Searching text in a PDF using Python? to find page numbers and use tabula to extract csv but as some words are part of other string the page number appears twice and i cant get the right table.

P = "STATEMENT OF PROFIT AND LOSS" #This is on Page 73

S = "CONSOLIDATED STATEMENT OF PROFIT AND LOSS" #This is on Page 131

import tabula
import PyPDF2
import re

pdffile="NCC.pdf"

reader = PyPDF2.PdfFileReader(pdffile)

P = "STATEMENT OF PROFIT AND LOSS"   #This is on Page 73
S = "CONSOLIDATED STATEMENT OF PROFIT AND LOSS"   #This is on Page 131

for i, page in enumerate(reader.pages, start=1):
    text = page.extract_text()
    if re.search(P, text):
     print("P String Found on Page: " + str(i))
     sat=(str(i))
    if re.search(S, text):
     print("S String Found on Page: " + str(i+1))
     con=(str(i))

tabula.convert_into(pdffile,"outputfile.csv",output_format="csv",pages=sat)

Response to the for loop in above code is

P String Found on Page: 73
P String Found on Page: 131
S String Found on Page: 131

the variable created for P String saves 131 instead of 73. hence the response for tabula.convert_into saves table from page 131 only. wereas i need table from 73.

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
Shashi
  • 1
  • 3
  • Try to break your problem down. The pdf handling part seems to work as expected, right? How could you change the code in the question to remove the pdf part? – Martin Thoma Jun 30 '22 at 22:00
  • Sorry i didnt explain it correctly, I have edited it a little. I do get the page numbers but i want 73 not 131. through `re.search("^statement")` i thought i could only get response for sentence starting with statement but that doesnt work either or maybe i am using the regex wrong. – Shashi Jul 01 '22 at 05:11
  • Is P on page 131? – Martin Thoma Jul 02 '22 at 09:35
  • Thanks KJ, i will try that. Also Martin the P that i want is on 73 but as P is also part of S the result shows as if its in S as well. But thanks i think i have got the hang of it from KJ will try that – Shashi Jul 04 '22 at 06:25

0 Answers0