12

I have a collection of .pdf files with comments that were added in Adobe Acrobat. I would like to be able to analyze these comments, but I'm kind of stuck on extracting them. I've looked at the pdftools package, but it seems to only be able to extract the text and not the comments. Is there a method available for extracting the comments within R?

Robert Bradford
  • 154
  • 1
  • 6

3 Answers3

9

PyMuPDF (https://pymupdf.readthedocs.io/en/latest/) is the only python library I have found working.

Installation in Debian/Ubuntu-based distributions:

apt-get install python3-fitz

Script:

import fitz
doc = fitz.open("example.pdf")
for i in range(doc.pageCount):
  page = doc[i]
  for annot in page.annots():
    print(annot.info["content"])
Martin Monperrus
  • 1,845
  • 2
  • 19
  • 28
Bernuly
  • 101
  • 1
  • 3
  • BTW, people may find useful to know that to install fitz in a conda environment, you should activate the environment, then run `pip install fitz`. See https://github.com/kastman/fitz/blob/master/doc/source/installing.rst Or, even better, `pip install pymupdf` (it installs fitz, and avoids errors like this https://github.com/pymupdf/PyMuPDF/issues/523#issuecomment-830746585) – Homero Esmeraldo May 20 '22 at 03:23
  • Is there a way to make fitz extract the highlighted content as well? I created a related question: https://stackoverflow.com/questions/72311956/how-to-extract-highlights-and-text-box-contents-from-pdf-in-python – Homero Esmeraldo May 20 '22 at 03:27
0

Did you try PoDoFo or another OpenSource tool that can access the PDF elements? You can also look at Extracting PDF annotations/comments here on stackoverflow if you will do little programming

PDFix
  • 1
  • 2
  • I've tried a few tools, but they all seem focused on extracting images and text.The Python method you linked to combined with the reticulate package looks promising and I'd actually played around with that a bit last week, but the poppler module doesn't seem to want to install. I guess there isn't a native solution in R. – Robert Bradford Jun 14 '18 at 15:06
  • I got it. Sometimes it´s hard to find working solution for such specific cases. Have you tried looking for some paid solution that would work? Some of them offer free trial. Which programming language and platform do you prefer? – PDFix Jun 15 '18 at 13:49
  • My preference would be a method that imported the comments into R on Windows as a data.frame. I was finally able to get poppler working using the Linux subsystem on Windows which is less than optimal, but better than nothing. – Robert Bradford Jun 18 '18 at 19:11
0

Screenshot of how >> Export the comments as an Excel file, then import it into R?

Eg: in PDF-X-change Editor, go to comment > summarize comments > export into whatever format you want. Similar in Adobe.

ecm
  • 2,583
  • 4
  • 21
  • 29
  • 2
    As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-ask). – Community Sep 14 '21 at 05:34