Splitting a pdf in two everytime a keyword appears with python

Question

My goal is to split a pdf everytime a keyword is in it. So I have a document with a lot of different experiments in it, sometimes two on one page sometimes one experiment over 10 pages.My goal is everytime "new experiment" appears on a site I would like to have this site and every following site until new experiment appears again in an own pdf. I tried to write a code but it does not work on different lines:

   import PyPDF2
        import re
        import os
        path = path
        file = open("path.pdf", "rb")
        pdfReader=PyPDF2.PdfFileReader(file)
        number_of_pages=pdfReader.numPages
        print(number_of_pages)
        list = []


        PageFound = -1
        Keyword = "experiment"
        for i in range(0, number_of_pages):
            content = ""
            content += pdfReader.getPage(i).extractText() + "\n"
            content1 = content.encode('ascii', 'ignore').lower()
            content2 = content1.decode(encoding="utf-8")
            ResSearch = re.search(Keyword, content2)
            print(ResSearch)
            if ResSearch is not None:
                  PageFound = i
                  list.append(i)
                  break

        #partially code from: https://stackoverflow.com/questions/12571905/finding-on-which-page-a-search-string-is-located-in-a-pdf-document-using-python

#problem here is that no elements can be added to the list, because it always returns none   


        fname = os.path.splitext(os.path.basename(path))[0]
        for page in list:
            while page <= list[list.index(page)+1)] in list
                pdf_writer = PdfFileWriter()
                pdf_writer.addPage(pdf.getPage(page))
                page = page + 1
                output_filename = '{}_page_{}.pdf'.format(
                fname, page+1)

            with open(output_filename, 'wb') as out:
                pdf_writer.write(out)

            print('Created: {}'.format(output_filename))

    #problem here starts, that I cannot command that it should add to the new pdf the sites until the keywords appears again

If anybody has suggestion for my code or probably has a better one, I would appreciate your help

Thanks a lot

Splitting a pdf in two everytime a keyword appears with python

0 Answers0