How to extract the abstract of webpage?

Question

I am writing a code to extract the abstract from the arxiv page, for example the page http://arxiv.org/abs/1207.0102, I am interested in extracting the text from "We study a model of..." to "...compass-Heisenberg model." my code currently looks like

$url="http://arxiv.org/abs/1207.0102";
$options = array(
  'http'=>array(
    'method'=>"GET",
    'header'=>"User-Agent: Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko\r\n"
  )
);
$context = stream_context_create($options);
$str = file_get_contents($url, false, $context);

if (preg_match('~<body[^>]*>(.*?)</body>~si', $str, $body))
{
    echo $body[1];
}

The problem with this is that it extracts everything in the body tag. Is there a way to extract the abstract only?

preg_match('~
(.*?)
~si', $str, $body) - should be sufficient in this case, but everyone will say: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 :) — sinisake, Aug 15 '15 at 21:42

Brandon Max · Answer 1 · 2015-08-15T21:48:38.857

The best option would be to use a DOM Parser, php has one built in at http://php.net/manual/en/class.domdocument.php but there is also tons of classes out there that do something similar.

Using DOM Document you would do something like this:

<?php
  $doc = new DOMDocument();
  $doc->loadHTML("<html><body>Test<br></body></html>");
  $text = $doc->getElementById("abstract");
?>

The other option is to use regex, which seems like what you're already doing. As you can tell it is a little bit more messy and requires some learning, http://www.regular-expressions.info/tutorial.html

Thanks.

How to extract the abstract of webpage?

1 Answers1