1

I am writing a code to extract the abstract from the arxiv page, for example the page http://arxiv.org/abs/1207.0102, I am interested in extracting the text from "We study a model of..." to "...compass-Heisenberg model." my code currently looks like

$url="http://arxiv.org/abs/1207.0102";
$options = array(
  'http'=>array(
    'method'=>"GET",
    'header'=>"User-Agent: Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko\r\n"
  )
);
$context = stream_context_create($options);
$str = file_get_contents($url, false, $context);

if (preg_match('~<body[^>]*>(.*?)</body>~si', $str, $body))
{
    echo $body[1];
}

The problem with this is that it extracts everything in the body tag. Is there a way to extract the abstract only?

user3741635
  • 852
  • 6
  • 16
  • preg_match('~
    (.*?)
    ~si', $str, $body) - should be sufficient in this case, but everyone will say: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 :)
    – sinisake Aug 15 '15 at 21:42

1 Answers1

1

The best option would be to use a DOM Parser, php has one built in at http://php.net/manual/en/class.domdocument.php but there is also tons of classes out there that do something similar.

Using DOM Document you would do something like this:

<?php
  $doc = new DOMDocument();
  $doc->loadHTML("<html><body>Test<br></body></html>");
  $text = $doc->getElementById("abstract");
?>

The other option is to use regex, which seems like what you're already doing. As you can tell it is a little bit more messy and requires some learning, http://www.regular-expressions.info/tutorial.html

Thanks.