My current work will dealing with a lot of pdf files (datasheet). I need to detect the differences between the old version and the new version datasheets. Is there any version control method can help automatically display the pdf file differences? Free method is very welcomed. Is there any python package can help with my request?
Asked
Active
Viewed 539 times
1
-
The pdf files normally contains the figure, table, text and timing diagram. – Janet Zheng Oct 11 '17 at 10:20
-
Do you need to extract differences or version control? Most version control systems work diff based, but that's usually not applicable for PDF data format which may contain binary data. If you need only version control, that's easy: almost every system should be able to work with PDF files. If you need diff between PDF files you'll probably have to extract the data from the PDF first. – Johannes Müller Oct 11 '17 at 10:30
-
I need diff between PDF files. – Janet Zheng Oct 11 '17 at 10:43
-
For extracting the data from the PDF, I am thinking about using pdfminer, but I haven't tried with it. Since my PDF contents are normally hundreds pages, I don't know how efficient it could be... – Janet Zheng Oct 11 '17 at 10:46
-
For a visual diff between PDF files, see [How can I unittest whether PDF files have been generated correctly?](https://stackoverflow.com/questions/38482918/how-can-i-unittest-whether-pdf-files-have-been-generated-correctly/38552066#38552066) – Brecht Machiels Oct 11 '17 at 10:48
-
Is there any possibility that I can have the pdf differences displayed like the way BeyondCompare does? Not for just single page. – Janet Zheng Oct 11 '17 at 11:08
-
@JanetZheng You can try to extract the text with a tool such as `pdftotext` (included with poppler) and feed that to BeyondCompare. But I expect this will not be a satisfactory solution either. I'm afraid this is simply not something PDF is suited for. – Brecht Machiels Nov 03 '17 at 11:42