Parsing PDF and getting the header portion information

Question

Am trying to parse the contents of PDFs. Basically they are scientific research papers.

Here's the portion am trying to grab:

I only need the paper title and the author name(s).

What I used is the PDF Parser Library. And I was able to get the header portion text using this code:

function get_pdf_prop( $file )
{
    $parser = new \Smalot\PdfParser\Parser();
    $pdf    = $parser->parseFile( $file );

    $details  = $pdf->getDetails();

    $page = $pdf->getPages()[0];

    //-- Extract the text of the first page
    $text = $page->getText();
    $text = explode( 'ABSTRACT', $text, 2 );    //-- get the text before the "ABSTRACT"
    $text = $text[0];

    //-- split the lines
    $lines = explode( "\n", $text );

    return array(
        'total_pages'   => $details['Pages'],
        'paper_title'   => $lines[0] . $lines[1],
        'author'        => $lines[2]
    );
}

What I did is, parse the full text of first page, then it will return the whole text in plain format. Since the required content is before the word ABSTRACT, I tried splitting the text and then splitting the lines.

And I assume the first two lines are the title and the third line is the author name. So far papers like I shown in the screenshot above gives correct results.

But problems happens during the following scenarios:

If paper title is a single line, I don't know it before hand. So my code will always return the first two lines as paper tile. And this might give both the title and author name as paper_title
If paper title is three lines, again this will give issues.
If there are more than 1 author, then my code will not return the proper data.

So any suggestions on how effectively I can grab the data like Paper Title and Author Name(s) from a PDF scientific paper? Am sure that they all follow a same pattern while creating PDFs using the LateX tools. Any better solutions or clues?

Kindly note that, am trying to do this on the paper uploaded in my site. And am using PHP as the server side language.

Thank you

Is there always a blank line between the title and the author details? Looking for that would allow you to deal with varying numbers of lines for the title. — droopsnoot, Jul 11 '19 at 11:19
@droopsnoot Looking at the code I think the blank line is not returned by `$page->getText();`. It would have been nice. — KIKO Software, Jul 11 '19 at 11:23
I think that this problem cannot be solved when using `$page->getText();`, which returns plain text. — KIKO Software, Jul 11 '19 at 11:30
Have you tried retrieving document meta-data - sample code block from [PDF Parser documentation](https://www.pdfparser.org/documentation). — lovelace, Jul 11 '19 at 11:41
@KIKOSoftware, yeah it returns plain text. And because of that, I had to do this splitting and guessing method to grab the info! — Akhilesh B Chandran, Jul 11 '19 at 14:05
@lovelace, yes already tried and those meta info was not there in the PDFs that I tested. — Akhilesh B Chandran, Jul 11 '19 at 14:06
Would it help if you would know the font family, size or position? You may check out our commerical PDF/PHP tool [SetaPDF-Extractor](https://www.setasign.com/extractor) for such a task. — Jan Slabon, Jul 11 '19 at 17:40

score 0 · Answer 1 · answered Jul 11 '19 at 13:07

You could try using PDF meta data to retrieve the 'fields' you need (author, title, other...). I have tried a few scientific papers, at random, and they all have (as least) meta-data for pages, author and title.

PDF Parser docs show how this can be done:

<?php

// Include Composer autoloader if not already done.
include 'vendor/autoload.php';

// Parse pdf file and build necessary objects.
$parser = new \Smalot\PdfParser\Parser();
$pdf    = $parser->parseFile('document.pdf');

// Retrieve all details from the pdf file.
$details  = $pdf->getDetails();

// Loop over each property to extract values (string or array).
foreach ($details as $property => $value) {
    if (is_array($value)) {
        $value = implode(', ', $value);
    }
    echo $property . ' => ' . $value . "\n";
}

?>

Sample output for a randomly picked paper (var_dump($details)):

array(7) {
  ["Author"]=>
  string(18) "Chris Fraley et al"
  ["CreationDate"]=>
  string(25) "2011-06-23T19:20:24+01:00"
  ["Creator"]=>
  string(26) "pdftk 1.41 - www.pdftk.com"
  ["ModDate"]=>
  string(25) "2019-07-11T14:56:29+02:00"
  ["Producer"]=>
  string(45) "itext-paulo-155 (itextpdf.sf.net-lowagie.com)"
  ["Title"]=>
  string(38) "Probabilistic Weather Forecasting in R"
  ["Pages"]=>
  int(9)
}

Thank you. But this can be done only if the PDF file has these meta information. But all those papers which I tried doesn't have these filled! I mean it has empty data in those fields. FYI, I was downloading and testing papers from this website: http://ceur-ws.org — Akhilesh B Chandran, Jul 11 '19 at 14:03

Parsing PDF and getting the header portion information

1 Answers1