I am writing a code to extract the abstract from the arxiv page, for example the page http://arxiv.org/abs/1207.0102, I am interested in extracting the text from "We study a model of..." to "...compass-Heisenberg model." my code currently looks like
$url="http://arxiv.org/abs/1207.0102";
$options = array(
'http'=>array(
'method'=>"GET",
'header'=>"User-Agent: Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko\r\n"
)
);
$context = stream_context_create($options);
$str = file_get_contents($url, false, $context);
if (preg_match('~<body[^>]*>(.*?)</body>~si', $str, $body))
{
echo $body[1];
}
The problem with this is that it extracts everything in the body tag. Is there a way to extract the abstract only?