Parsing curl results as simpleXML and using those to create new XML data

Question

I'm pulling data from PubMed as XML and using curl to process those results which I load into another page as SimpleXML. This allows me to grab the information I need (a list of pub IDs) and use that as a variable for ANOTHER pubmed scrape. This one gets the summaries of the specific pub IDs. Here's my first file (the $name will eventually be dynamic):

<?php 
header('Content-type: text/xml');
$name = 'white,theodore';

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term='.$name.'[author]&retmode=xml&retmax=50');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_VERBOSE, 0);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_MAXREDIRS, 10);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'POST');
curl_setopt($ch, CURLOPT_FRESH_CONNECT, 1);

$output = curl_exec($ch);

print $output;

curl_close($ch);

?>

Which exports XML data that includes (among other things) a list of Pub Ids.

xml output

<eSearchResult>
<Count>45</Count>
<RetMax>45</RetMax>
<RetStart>0</RetStart>
  <IdList>
  <Id>27431223</Id>
  <Id>26234644</Id>
  <Id>25824209</Id>
  <Id>25667269</Id>
  <Id>25646566</Id>
  <Id>25085959</Id>
  <Id>24453983</Id>
  <Id>23908482</Id>
  <Id>23845238</Id>
  <Id>23758576</Id>
  <Id>23606207</Id>
  <Id>23475705</Id>
  <Id>23253612</Id>
  <Id>22951933</Id>
  <Id>22479177</Id>
  <Id>22080454</Id>
  <Id>21977036</Id>
  <Id>21951709</Id>
  <Id>21247460</Id>
  <Id>21145410</Id>
  <Id>21078937</Id>
  <Id>20941354</Id>
  <Id>20737430</Id>
  <Id>20656915</Id>
  <Id>20430817</Id>
  <Id>20161440</Id>
  <Id>19880755</Id>
  <Id>18757808</Id>
  <Id>18675371</Id>
  <Id>18539886</Id>
  <Id>18436555</Id>
  <Id>18404551</Id>
  <Id>18343803</Id>
  <Id>18310042</Id>
  <Id>17951521</Id>
  <Id>17071565</Id>
  <Id>15980350</Id>
  <Id>15766602</Id>
  <Id>15590814</Id>
  <Id>15047513</Id>
  <Id>14653518</Id>
  <Id>12576598</Id>
  <Id>12517831</Id>
  <Id>12019079</Id>
  <Id>11932451</Id>
</IdList>
<TranslationSet>
<Translation>
  <From>white, theodore[author]</From>
  <To>White, Theodore[Full Author Name]</To>
</Translation>
</TranslationSet>
<TranslationStack>
<TermSet>
  <Term>White, Theodore[Full Author Name]</Term>
  <Field>Full Author Name</Field>
  <Count>45</Count>
  <Explode>N</Explode>
</TermSet>
<OP>GROUP</OP>
</TranslationStack>
<QueryTranslation>White, Theodore[Full Author Name] </QueryTranslation>
</eSearchResult>

I then load that into another page so I can use SimpleXML to convert the Pub IDs into a variable. And using that variable, attempt another curl/pubmed request, this one pulls summaries based on those IDs:

<?php
$xml=simplexml_load_file('https://sbs2.umkc.edu/wp-content/themes/SBS_Theme/js/pubMedExport.php','SimpleXMLElement', LIBXML_NOCDATA) or die("Error: Cannot create object");
$idList = $xml->IdList;
foreach($idList->children() as $id) {
$idResult = $id . ",";
//echo $idResult;

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&id='.$id.'');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_VERBOSE, 0);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_MAXREDIRS, 10);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'POST');
curl_setopt($ch, CURLOPT_FRESH_CONNECT, 1);

$result = curl_exec($ch);
echo $result . "</br></br>";

curl_close($ch);

}
?>

I can get this to export as individual citations but my problem is, I need to still be able to grab ahold of that second set of data so that I can format certain things like Authors and exclude irrelevant data.

full citations

Here's the XML from ONE result.

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE eSummaryResult PUBLIC "-//NLM//DTD esummary v1 20041029//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20041029/esummary-v1.dtd">
<eSummaryResult>
  <DocSum>
    <Id>27431223</Id>
    <Item Name="PubDate" Type="Date">2016 Oct</Item>
    <Item Name="EPubDate" Type="Date">2016 Sep 23</Item>
    <Item Name="Source" Type="String">Antimicrob Agents Chemother</Item>
    <Item Name="AuthorList" Type="List">
      <Item Name="Author" Type="String">Bhattacharya S</Item>
      <Item Name="Author" Type="String">Sobel JD</Item>
      <Item Name="Author" Type="String">White TC</Item>
    </Item>
    <Item Name="LastAuthor" Type="String">White TC</Item>
    <Item Name="Title" Type="String">A Combination Fluorescence Assay Demonstrates Increased Efflux Pump Activity as a Resistance Mechanism in Azole-Resistant Vaginal Candida albicans Isolates.</Item>
    <Item Name="Volume" Type="String">60</Item>
    <Item Name="Issue" Type="String">10</Item>
    <Item Name="Pages" Type="String">5858-66</Item>
    <Item Name="LangList" Type="List">
    <Item Name="Lang" Type="String">English</Item>
    </Item>
    <Item Name="NlmUniqueID" Type="String">0315061</Item>
    <Item Name="ISSN" Type="String">0066-4804</Item>
    <Item Name="ESSN" Type="String">1098-6596</Item>
    <Item Name="PubTypeList" Type="List">
      <Item Name="PubType" Type="String">Journal Article</Item>
    </Item>
    <Item Name="RecordStatus" Type="String">Unknown status</Item>
    <Item Name="PubStatus" Type="String">epublish</Item>
    <Item Name="ArticleIds" Type="List">
      <Item Name="pubmed" Type="String">27431223</Item>
      <Item Name="pii" Type="String">AAC.01252-16</Item>
      <Item Name="doi" Type="String">10.1128/AAC.01252-16</Item>
      <Item Name="pmc" Type="String">PMC5038269</Item>
      <Item Name="rid" Type="String">27431223</Item>
      <Item Name="eid" Type="String">27431223</Item>
      <Item Name="pmcid" Type="String">pmc-id: PMC5038269;embargo-date: 2017/04/01;</Item>
    </Item>
    <Item Name="DOI" Type="String">10.1128/AAC.01252-16</Item>
    <Item Name="History" Type="List">
      <Item Name="received" Type="Date">2016/06/10 00:00</Item>
      <Item Name="accepted" Type="Date">2016/07/12 00:00</Item>
      <Item Name="pmc-release" Type="Date">2017/04/01 00:00</Item>
      <Item Name="entrez" Type="Date">2016/07/20 06:00</Item>
      <Item Name="pubmed" Type="Date">2016/07/20 06:00</Item>
      <Item Name="medline" Type="Date">2016/07/20 06:00</Item>
    </Item>
    <Item Name="References" Type="List"></Item>
    <Item Name="HasAbstract" Type="Integer">1</Item>
    <Item Name="PmcRefCount" Type="Integer">0</Item>
    <Item Name="FullJournalName" Type="String">Antimicrobial agents and chemotherapy</Item>
    <Item Name="ELocationID" Type="String">doi: 10.1128/AAC.01252-16</Item>
    <Item Name="SO" Type="String">2016 Oct;60(10):5858-66</Item>
</DocSum>

</eSummaryResult>
</br></br>

I can't figure out how to grab the items in that second set of data. The source reveals it's still formatted properly but I keep getting "Trying to get property of non-object" errors.

I considered sending these results to yet another file and use SimpleXML to control it, but because I'm parsing the first file and adding another curl on the same page, it doesn't seem to like it when I add the header

Any help would be greatly appreciated!

UPDATE: Thanks to @EatPeanutButter for pointing me in the right direction. By using $cxml=simplexml_load_string($result); instead of $Cxml = new SimpleXMLElement($result); I was not only able to grab the data I needed, but also combine the curls onto a single page as follows.

<?php 
$name = 'white,theodore';
// Return xml data from PubMed based on author search name
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term='.$name.'[author]&retmode=xml&retmax=50');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_VERBOSE, 0);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_MAXREDIRS, 10);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'POST');
curl_setopt($ch, CURLOPT_FRESH_CONNECT, 1);

$output = curl_exec($ch);

curl_close($ch);

// Parse the results and concatenate into a string of Publication IDs
$xml=simplexml_load_string($output);
$idList = $xml->IdList;
$ids = "";
foreach($idList->children() as $id) {
    $ids .= $id . ",";
}

// Plug that string of IDs into another PubMed search, this one returning XML data for Publication Summaries
$path = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&id='.$ids;

$ch2 = curl_init();
curl_setopt($ch2, CURLOPT_URL, $path);
curl_setopt($ch2, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch2, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch2, CURLOPT_VERBOSE, 0);
curl_setopt($ch2, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch2, CURLOPT_AUTOREFERER, true);
curl_setopt($ch2, CURLOPT_MAXREDIRS, 10);
curl_setopt($ch2, CURLOPT_CUSTOMREQUEST, 'POST');
curl_setopt($ch2, CURLOPT_FRESH_CONNECT, 1);

$result = curl_exec($ch2);

curl_close($ch2);
// Parse those results and print only what is needed for Citation format
$cxml=simplexml_load_string($result);
foreach($cxml->children() as $docsum) {
  foreach($docsum->children() as $item) {
    foreach($item->children() as $details) {
        if ((string) $details['Name'] === 'Author') {echo $details . "., ";}
    }
    if ((string) $item['Name'] === 'FullJournalName') { echo $item . ". "; }
    if ((string) $item['Name'] === 'Title') { echo "<strong>" . $item . "</strong> "; }
    if ((string) $item['Name'] === 'Volume') { echo "Vol." . $item . ", "; }
    if ((string) $item['Name'] === 'Issue') { echo "Issue" . $item . ". "; }
    if ((string) $item['Name'] === 'PubDate') { echo $item . ". "; }
    foreach($item->children() as $details) {
            if ((string) $details['Name'] === 'PubType') {echo $details . ", ";}
        }
  }
  echo "</br></br>";
}

?>

And now, of course, this has created a new issue which I'm going to post as a follow up question!

I'm a little confused - are you wanting to format the data in the 2nd screenshot (which I think is what's being returned from the cURL requests in the foreach loop)? Does that just come back as plain text, or as XML? — WillardSolutions, Dec 05 '16 at 22:04
Yes, I need to be able to break apart those citations so I can add and style only what I need. It echoes as the blocks you see in the second screen shot but the source shows the XML structure. I couldn't add a third screenshot (no rep) so I've added it to my dropbox here: https://www.dropbox.com/s/64echgckquybbu0/Screen%20Shot%202016-12-05%20at%203.53.26%20PM.png?dl=0 - When I try to grab them with $results->esummaryresult->etc, I get the "trying to get property of non-object" error. — cebronix, Dec 05 '16 at 22:09
There is no `$results` in your code. Can you add the XML to the question so we can have a reproducible example? — chris85, Dec 05 '16 at 22:15
The error is because you are trying to access the XML object without parsing it as XML first. Try adding `$xml = new SimpleXMLElement($result);` just before your echo, and then try to read the XML as you were doing before. See this (1st answer) for a nice function you can use: http://stackoverflow.com/questions/561816/php-curl-extract-an-xml-response — WillardSolutions, Dec 05 '16 at 22:17
You added 2 XML files. Which file are you having issues parsing? You also could be violating the PM terms with this code, isn't it 1 request per 3 seconds? — chris85, Dec 05 '16 at 22:37
@EatPeanutButter Parsing it as XML returns a blank page, including the source. `$Cxml = new SimpleXMLElement($result); echo $Cxml;`Wrapping it in a function like the link suggested didn't seem to help. Same results except it no longer recognized my $id variable so I plugged some ids in manually to test & still got a blank page. One issue I did find was I was using the $result for two different variables. I fixed that by renaming the top one $idResult. But it didn't help. — cebronix, Dec 07 '16 at 15:48
@chris85 , it's the second set I'm having trouble parsing. The second set takes data from the first set for it's query. When you asked about the PM terms, did you mean that might be why if I use the string of ids as my query it only returns the last result? I don't think that's an issue because if I manually plug in a string of multiple ids, it loads all of them. — cebronix, Dec 07 '16 at 15:50
I think I may have gotten it! I used `$cxml=simplexml_load_string($result);` instead of `$Cxml = new SimpleXMLElement($result);` and now I'm able to grab a hold of individual elements! I still can't figure out why if I use the $idResult in my 2nd query instead of $id it only returns the last item. The string echoes fine. This forces me to include the curl inside the loop. Whereas if I paste in the exported $idResult, it works fine. — cebronix, Dec 07 '16 at 19:57
And in case anyone is still reading this, I figured out the $idResult issue. Turns out I wasn't concatenating the string at all. Only adding a comma. I had to add that one sticking period. `foreach($idList->children() as $id) { $idResult .= $id . ","; }` — cebronix, Dec 09 '16 at 15:11

Parsing curl results as simpleXML and using those to create new XML data

0 Answers0

Linked