0

I downloaded a series of pdf files and I want to join them. I am aware of PyPDF or similar modules, but I want to know why I cannot use file.write() method for joining pdf files.

Here is the code I used to download pdf files.

for i in range(3):
    url = 'http://ncert.nic.in/ncerts/l/leph10{}.pdf'.format(i+1)
    response = requests.get(url)
    with open ('file{}.pdf'.format(i+1), 'wb') as file:
        for chunk in response.iter_content(chunk_size= 1024):
            file.write(chunk)

Then I used following code to join them.

with open ('combined.pdf', 'ab') as combined:
    for i in range(2,-1,-1 ):
        with open ('file{}.pdf'.format(i+1), 'rb') as file:
            for chunk in file:
                combined.write(chunk)

The combined file contains only the first file, but not the remaining two files. However, the size of combined file is sum of size of all three files.

I searched through many blogs/questions here to find answers, but everyone seems to suggest PyPDF or similar modules for dealing with PDFs in Python.

My questions are:

i) Why is code joining/appending from the first file only, even though the actual size of combined file is much bigger. I am not getting any exceptions/errors.

ii) Why can I not join pdf files using such simple write() method in Python?

SourabhJain
  • 55
  • 2
  • 6

1 Answers1

0

Basically, because PDF files are very complicated things. Each PDF has a header, data, and an end section. So, if you glue a few of them together, the reader you use to look at them will find the first PDF's end section and just finish reading, ignoring any information that follow.

There are (at least in Unix/Linux) several tools which will permit you to combine PDFs. One example is pdfjoin, of which the manual page says:

pdfjoin concatenates the pages of multiple Portable Document Format (PDF) files together into a single file.

( pdfjoin is part of the "PDFjam" package of tools )

Note that even such programs can probably encounter problems, as there might be conflicts on how the data is saved in each PDF.

EDIT: Also PDF documents are fairly difficult to decode. Just to make a point... here's a very minimal PDF: just the work John on an empty page. From 6 characters in the original text file, converted to Postscript, it occupies 13000+ characters. Converting to PDF (with ps2pdf) it is reduced to 3800+ bytes.

This is part of the header section of the PDF:

%PDF-1.4
%.쏢
5 0 obj
<</Length 6 0 R/Filter /FlateDecode>>
stream
-- edited out ---
endstream
endobj
6 0 obj
97
endobj
4 0 obj
<</Type/Page/MediaBox [0 0 612 792]
/Rotate 0/Parent 3 0 R
/Resources<</ProcSet[/PDF /Text]
/ExtGState 10 0 R
/Font 11 0 R
...

Note that in line 4, the length of the text has been encoded (6), so you'll have to recode this in the output file. And this is part of the end section:

<</Producer(GPL Ghostscript 9.20)
/CreationDate(D:20171004210838-03'00')
/ModDate(D:20171004210838-03'00')
/Title(john.txt)
/Author()
/Creator(a2ps version 4.14)>>endobj
xref
0 14
0000000000 65535 f
0000000419 00000 n
0000003214 00000 n
0000000360 00000 n
0000000200 00000 n
0000000015 00000 n
0000000182 00000 n
0000000484 00000 n
0000000585 00000 n
0000000820 00000 n
0000000525 00000 n
0000000555 00000 n
0000001081 00000 n
0000001733 00000 n
trailer
<< /Size 14 /Root 1 0 R /Info 2 0 R
/ID [<EF5D1976DF3773944878D6157BCEE651><EF5D1976DF3773944878D6157BCEE651>]
>>
startxref
3392
%%EOF

The original text isn't even in readable form in the PDF: It has probably been recoded to a vector format or so.

I'm not saying it's impossible, but I would suggest you at least use a library of some kind to disassemble the original PDFs, and to re-code them for the output. Have a look at 'Manipulating PDFs with Python' or [PDFMiner][4].

jcoppens
  • 5,306
  • 6
  • 27
  • 47
  • "reader you use to look at them will find the first PDF's end section and just finish reading, ignoring any information that follow." I do not understand. Could you please elaborate on it? when I use 'a' mode, does that not put pointer at the end of the file and start amending from that point? Also, why is the size of the combined pdf is equivalent to sum of all joined files, but the actual data in combined file is only from first file? – SourabhJain Oct 04 '17 at 19:36
  • Added some comments to my answer. And, as I said in the answer, the output file *does* contain all the PDFs you glued together, but only shows the first one which explains the size. – jcoppens Oct 05 '17 at 00:27
  • Thanks. Your comments do help to understand why it is difficult to write to a pdf. I found this link also helpful. https://stackoverflow.com/questions/45953770/creating-and-writing-to-a-pdf-file-in-python?rq=1 – SourabhJain Oct 05 '17 at 04:59