My goal is to split a pdf everytime a keyword is in it. So I have a document with a lot of different experiments in it, sometimes two on one page sometimes one experiment over 10 pages.My goal is everytime "new experiment" appears on a site I would like to have this site and every following site until new experiment appears again in an own pdf. I tried to write a code but it does not work on different lines:
import PyPDF2
import re
import os
path = path
file = open("path.pdf", "rb")
pdfReader=PyPDF2.PdfFileReader(file)
number_of_pages=pdfReader.numPages
print(number_of_pages)
list = []
PageFound = -1
Keyword = "experiment"
for i in range(0, number_of_pages):
content = ""
content += pdfReader.getPage(i).extractText() + "\n"
content1 = content.encode('ascii', 'ignore').lower()
content2 = content1.decode(encoding="utf-8")
ResSearch = re.search(Keyword, content2)
print(ResSearch)
if ResSearch is not None:
PageFound = i
list.append(i)
break
#partially code from: https://stackoverflow.com/questions/12571905/finding-on-which-page-a-search-string-is-located-in-a-pdf-document-using-python
#problem here is that no elements can be added to the list, because it always returns none
fname = os.path.splitext(os.path.basename(path))[0]
for page in list:
while page <= list[list.index(page)+1)] in list
pdf_writer = PdfFileWriter()
pdf_writer.addPage(pdf.getPage(page))
page = page + 1
output_filename = '{}_page_{}.pdf'.format(
fname, page+1)
with open(output_filename, 'wb') as out:
pdf_writer.write(out)
print('Created: {}'.format(output_filename))
#problem here starts, that I cannot command that it should add to the new pdf the sites until the keywords appears again
If anybody has suggestion for my code or probably has a better one, I would appreciate your help
Thanks a lot