8

enter image description here

I have a PDF file which contains annotation and suggestion for annotation that appear on mouse over of the annotated word.

For example, consider above image in which the word you'll spend is strike-through (means incorrect word), and on mouse over it shows pop up window in which correct word is appears. Similarly there is another caret sign which does same.

I want to extract the list of both word, which will show correct and incorrect word from files.

yivi
  • 42,438
  • 18
  • 116
  • 138
vivek salve
  • 991
  • 1
  • 9
  • 20
  • We have a demo for a commercial product (written in PHP) which does the same thing for highlight-annotations. It shouldn't be hard to adjust this to other annotation types. But I just wondering what data do you expect for the caret? – Jan Slabon Jul 30 '19 at 15:26
  • Caret is similar to the other annotation, on mouse over of caret a popup model appears showing some text like shown in above image. I am also interested in commercial product. – vivek salve Jul 31 '19 at 11:56
  • See this link https://stackoverflow.com/questions/1106098/parse-annotations-from-a-pdf. It uses python, but might point you in the right direction. If you can extract the data, you might be able to parse the information and filter out what you need. – user11809641 Aug 08 '19 at 05:34
  • You need to make clear what language you want to accomplish this with. This question is tagged both as PHP and JS. – yivi Aug 09 '19 at 12:08

3 Answers3

0

I just did a simple POC with our SetaPDF-Extractor component (a commercial product of us) which results in this: result of the POC

Sadly the comments "tree" in a PDF is not that trivial. The POC just iterates through the annotations and creates filters which are used by the extractor component then. Here is another demo that extracts the comments tree which may be the basis for a sorted/more logical result.

Here's the code I used for the given output:

<?php
// load and register the autoload function
require_once('library/SetaPDF/Autoload.php');

// create a document instance
$document = SetaPDF_Core_Document::loadByFilename('camtown/Terms-and-Conditions - revised.pdf');
    // initate an extractor instance
$extractor = new SetaPDF_Extractor($document);

// get page documents pages object
$pages = $document->getCatalog()->getPages();

// we are going to save the extracted text in this variable
$results = [];
// map pages and filternames to annotation instances
$annotationsByPageAndFilterName = [];

// iterate over all pages
for ($pageNo = 1, $pageCount = $pages->count(); $pageNo <= $pageCount; $pageNo++) {
    // get the page object
    $page = $pages->getPage($pageNo);
    // get the annotations
    $annotations = array_filter($page->getAnnotations()->getAll(), function(SetaPDF_Core_Document_Page_Annotation $annotation) {
        switch ($annotation->getType()) {
            case SetaPDF_Core_Document_Page_Annotation::TYPE_HIGHLIGHT:
            case SetaPDF_Core_Document_Page_Annotation::TYPE_STRIKE_OUT:
            case SetaPDF_Core_Document_Page_Annotation::TYPE_CARET:
            case SetaPDF_Core_Document_Page_Annotation::TYPE_UNDERLINE:
                return true;
        }

        return false;
    });

    // create a strategy instance
    $strategy = new SetaPDF_Extractor_Strategy_ExactPlain();
    // create a multi filter instance
    $filter = new SetaPDF_Extractor_Filter_Multi();
    // and pass it to the strategy
    $strategy->setFilter($filter);

    // iterate over all highlight annotations
    foreach ($annotations AS $tmpId => $annotation) {
        /**
         * @var SetaPDF_Core_Document_Page_Annotation_Highlight $annotation
         */
        $name = 'P#' . $pageNo . '/HA#' . $tmpId;
        if ($annotation->getName()) {
            $name .= ' (' . $annotation->getName() . ')';
        }

        if ($annotation instanceof SetaPDF_Core_Document_Page_Annotation_TextMarkup) {
            // iterate over the quad points to setup our filter instances
            $quadpoints = $annotation->getQuadPoints();
            for ($pos = 0, $c = count($quadpoints); $pos < $c; $pos += 8) {
                $llx = min($quadpoints[$pos + 0], $quadpoints[$pos + 2], $quadpoints[$pos + 4], $quadpoints[$pos + 6]) - 1;
                $urx = max($quadpoints[$pos + 0], $quadpoints[$pos + 2], $quadpoints[$pos + 4], $quadpoints[$pos + 6]) + 1;
                $lly = min($quadpoints[$pos + 1], $quadpoints[$pos + 3], $quadpoints[$pos + 5], $quadpoints[$pos + 7]) - 1;
                $ury = max($quadpoints[$pos + 1], $quadpoints[$pos + 3], $quadpoints[$pos + 5], $quadpoints[$pos + 7]) + 1;

                // reduze it to a small line
                $diff = ($ury - $lly) / 2;
                $lly = $lly + $diff - 1;
                $ury = $ury - $diff - 1;

                // Add a new rectangle filter to the multi filter instance
                $filter->addFilter(
                    new SetaPDF_Extractor_Filter_Rectangle(
                        new SetaPDF_Core_Geometry_Rectangle($llx, $lly, $urx, $ury),
                        SetaPDF_Extractor_Filter_Rectangle::MODE_CONTACT,
                        $name
                    )
                );
            }
        }

        $annotationsByPageAndFilterName[$pageNo][$name] = $annotation;
    }

    // if no filters for this page defined, ignore it
    if (count($filter->getFilters()) === 0) {
        continue;
    }

    // pass the strategy to the extractor instance
    $extractor->setStrategy($strategy);
    // and get the results by the current page number
    $result = $extractor->getResultByPageNumber($pageNo);
    if ($result === '')
        continue;

    $results[$pageNo] = $result;
}

// debug output
foreach ($annotationsByPageAndFilterName AS $pageNo => $annotations) {
    echo '<h1>Page No #' . $pageNo . '</h1>';
    echo '<table border="1"><tr><th>Name</th><th>Text</th><th>Subject</th><th>Comment</th></tr>';
    foreach ($annotations AS $name => $annotation) {
        echo '<tr>';
        echo '<td>' . $name . '</td>';
        echo '<td><pre>' . ($results[$pageNo][$name] ?? '') . '</pre></td>';
        echo '<td><pre>' . $annotation->getSubject() . '</pre></td>';
        echo '<td><pre>' . $annotation->getContents() . '</pre></td>';
        echo '</tr>';
    }

    echo '</table>';
}
Jan Slabon
  • 4,736
  • 2
  • 14
  • 29
  • This plugin sound good, but what i see is that it is not extracting the caret. It is possible to extract caret ? – vivek salve Aug 02 '19 at 12:49
  • Also i need the proper sequence of extracted word from top to end approach. – vivek salve Aug 02 '19 at 12:53
  • Actually the Carets are in the table as "Inserted Text". As I already wrote the annotation handling in PDF is not trivial and the Caret e.g. is such a special case. The "Cross-Out" is a reply to the "Inserted Text" annotation forming a single annotation in standard PDF viewers. You will see this if you pass such a document to [this](https://www.setasign.com/products/setapdf-core/demos/extract-comments/#p-393) demo. In any case I would suggest to request an evaluation version of the component [here](https://www.setasign.com/products/setapdf-extractor/evaluate/) so you can play with it your own. – Jan Slabon Aug 02 '19 at 13:33
0

have you tried this parser?

Features

  • Load and parse objects and headers
    Extract metadata (author, description, keywords, ...)
    Extract text from ordered pages
    Support for compressed pdf (and not)
    Support of charset encoding (WinAnsi, MacRoman)
    Handling of hexa and octal content encoding
    PSR-0 compliant (autoloader)
    Compatible with Composer
    PSR-1 compliant (code styling)

https://pdfparser.org/demo

Ezequiel Fernandez
  • 954
  • 11
  • 18
-1

You need to extract information about the Markup Annotations present on the page and their associated child Pop-up (what you referred to as 'suggestion') annotation contents. You can use the location of the Markup annotation to then reconcile with the Text being shown in that location on the page. You would then have the two pieces of information that you need.

JosephA
  • 1,187
  • 3
  • 13
  • 27