6

I have a PDF file which I removed some pages from it. I want to correct(fix) the new pdf page numbers. Is there any way/library to update the page numbers without converting the pdf to another format? I have tried to convert the pdf to text, XML, and JSON and then fix the page number. However, if I convert it back to pdf, it looks messy(cannot keep the style of the original pdf). The problems I have are:

  1. Removing the old page numbers.
  2. Adding new page numbers.

I am using python on Ubuntu. I have tried ReportLab, PyX, and pyfpdf.

Sina
  • 270
  • 1
  • 23
  • 1
    https://stackoverflow.com/a/2180841/7994074 – ParthS007 Jun 25 '19 at 18:55
  • Thanks. I have seen this post and tried it. The problem I have is: 1- remove old page number 2-add new page number. It does not work for me. – Sina Jun 25 '19 at 20:31
  • 1
    PyPDF2 might help. – Legorooj Jul 02 '19 at 12:54
  • @Legorooj thanks. Actually, I am using PyPDF2 in my project to read the original PDF pages and remove unwanted pages from the original one. The output of PyPDF2 is pure text and it is not convertible to the original format. Even if I use PyPDF2 to find page numbers, sometimes it updates the wrong string(page number). Thanks again. – Sina Jul 02 '19 at 16:59
  • Hmm. Will look into this. – Legorooj Jul 02 '19 at 17:03
  • 2
    https://stackoverflow.com/questions/31291282/how-to-add-page-number-to-a-pdf-file might be helpful. – Legorooj Jul 02 '19 at 17:09
  • 1
    https://www.geeksforgeeks.org/working-with-pdf-files-in-python/ the last tutorial could also help. – Legorooj Jul 02 '19 at 17:13
  • Thanks, I will check them. – Sina Jul 02 '19 at 18:57
  • 1
    PIL has worked well with me for writing PDFs, if that helps at all! – ladygremlin Jul 03 '19 at 00:56
  • @ladygremlin Thanks. Can you briefly explain the step you did? – Sina Jul 03 '19 at 16:18
  • 1
    @john I'm sorry, I didn't do page removal and addition, so I can't exactly help here. What I'd recommend is rebuilding the PDF with PIL and just removing the pages you don't want. That'd look like reading in a PDF, identifying pages you don't want, and then rebuilding a PDF without those pages and returning the new one. That'd keep page #s consistent, if nothing else. – ladygremlin Jul 03 '19 at 16:23
  • @ladygremlin Thanks a lot. – Sina Jul 03 '19 at 16:31
  • 1
    @john https://www.binpress.com/manipulate-pdf-python/ and https://stackoverflow.com/questions/1180115/add-text-to-existing-pdf-using-python might be helpful – Kumar Mangalam Jul 04 '19 at 07:01
  • @Mangy007 Thanks. Will take a look. – Sina Jul 05 '19 at 16:26

1 Answers1

3

I have had a similar problem, I honestly could not fully solve it, rather, I fetched the corresponding html and processed it with BeautifulSoup. However, I did get a closer approach than python modules, I used pdftotext.exe from poppler (link at the bottom) to read the pdf file, and it worked just fine, besides the fact that it was not able to distinguish between text columns. As this is not a python module, I used os.system to call the command string on the .exe file.

def call_poppler(input_pdf, input_path):

    """
    Call poppler to generate a txt file
    """
    command_row = input_path + " " + input_pdf
    os.system(command_row)
    txt_name = input_pdf[0:-4] + ".txt"
    processed_paper = open_txt(txt_name)
    return processed_paper

def open_txt(input_txt_name):

    """
    Open and generate a python object out of the
    txt attained with poppler
    """
    opened_file = open(input_txt_name,"rb").readlines()
    output_file = []
    for row in opened_file:
        row = row.decode("utf-8").strip()
        output_file.append(row)
    return output_file

This returns you a processed ".txt" file that you can then process as you want and rewrite as a pdf with some module, such as pypdf, sorry if it was not the answer you wanted, but pdf files are rather hard to handle in python since they are not text based files. Do not forget to give the path of the executable. You can get poppler here: https://poppler.freedesktop.org/

Preto
  • 78
  • 1
  • 6
  • Thanks. How can I keep the format? PYPDF2 is doing the same thing. – Sina Jul 02 '19 at 23:54
  • 1
    The way I propose you do not keep the format, you transfer to another format and then rewrite to pdf, sorry, that is the best I was able to come up with when considering the problem, the only difference to the modules is that poppler seems to perform better. The problem is in the reading of the pdf, because it is not a text type file – Preto Jul 02 '19 at 23:59
  • Thanks. Do you know any way that I can add an image to pdf without changing the format? – Sina Jul 03 '19 at 16:13
  • 1
    Only know of python modules such as those you named, they are usually rather poor at reading the pdf's text... – Preto Jul 03 '19 at 16:57
  • Thanks again for the help. – Sina Jul 03 '19 at 17:01