0

We are indexing our journals with PHP. We have journal meta data files. I am trying to parse it with PHP SimpleXML but i am getting lots of errors.

Warning: simplexml_load_string(): Entity: line 19: parser error : Opening and ending tag mismatch: XUI line 19 and BB in *** on line 62

Warning: simplexml_load_string(): s;S PERSPECTIVE

Warning: simplexml_load_string(): ^ in *** on line 62

Warning: simplexml_load_string(): Entity: line 44: parser error : Opening and ending tag mismatch: BB line 4 and D in *** on line 62

Warning: simplexml_load_string(): 33rd ed. St. Louis, MO: Elsevier Health Sciences; 2016.

Warning: simplexml_load_string(): ^ in *** on line 62

Warning: simplexml_load_string(): Entity: line 61: parser error : Opening and ending tag mismatch: XUI line 61 and BB in *** on line 62

Warning: simplexml_load_string(): R TO THE EDITOR

Warning: simplexml_load_string(): ^ in *** on line 62

Warning: simplexml_load_string(): Entity: line 74: parser error : Opening and ending tag mismatch: BB line 46 and D in *** on line 62

When i looked at the file it seems like an XML file. How can i parse it with PHP?.

The code i am using is:

$file = file_get_contents('xyz.0');

$file = utf8_decode($file);
$file = str_replace("&", "", $file); //For problems with & character

//libxml_use_internal_errors(true);
$xml = simplexml_load_string($file, 'SimpleXMLElement', LIBXML_NOCDATA);

Sample XML Code from file:

<!DOCTYPE dg SYSTEM "ovidbase.dtd"> <DG><COVER NAME="G1893697-201804000-00000"> <D AN="01893697-201804000-00001" V="2009.2F" FILE="G1893697-201804000-00001"> <BB> <TG> <TI>Oh Blood Pressure Measurements&mdash;Where Art Thou&quest;</TI></TG> <BY> <PN><FN>G.</FN><MN>Stephen</MN><SN>Morris</SN><DEG>PT, PhD, FACSM</DEG></PN> <AF><P>President, Oncology Section of the APTA; and Professor, Department of Physical Therapy, Wingate University, Wingate, NC</P></AF> <BT><P><E T="B">Correspondence:</E> G. Stephen Morris, PT, PhD, FACSM, Department of Physical Therapy, Wingate University, 215 N. Camden Rd, Wingate, NC 28174 (<URL>s.morris&commat;wingate.edu</URL>).</P><P>The author declares no conflicts of interest.</P></BT></BY> <SO> <PB>Rehabilitation Oncology</PB> <ISN>2168-3808</ISN> <DA><MO>April</MO><YR>2018</YR></DA> <V>36</V> <IS><IP>2</IP></IS> <PG>79&ndash;80</PG></SO> <CP>&copy; 2018 Oncology Section, APTA.</CP> <DT>PRESIDENT&apos;S PERSPECTIVE</DT><XUI XDB="pub-doi" UI="10.1097/01.REO.0000000000000118"></BB> <BD> <LV1><HD>&NA;</HD> <P>physical therapy&quest;</P></LV1> <LV1><SG><SGN>G. Stephen Morris, PT, PhD, FACSM</SGN></SG></LV1></BD> <ED> <EDS><HD>REFERENCES</HD> <RF ID="R1-1">1. <JRF><DRF>Arena SK, Reyes A, Rolf M. Behaviors, and knowledge of outpatient physical therapists. Cardiopulm Phys Ther J. 2018;9:3&ndash;12.</DRF><PN><FN>SK</FN><SN>Arena</SN></PN><PN><FN>A</FN><SN>Reyes</SN></PN><PN><FN>M</FN><SN>Rolf</SN></PN><TI>Behaviors, and knowledge of outpatient physical therapists</TI><PB>Cardiopulm Phys Ther J</PB><DA><YR>2018</YR></DA><V>9</V><PG>3&ndash;12</PG></JRF></RF> <RF ID="R2-1">2. <URF>US Preventative Services Task Force. High blood pressure in adults: screening. https:&sol;&sol;www.uspreventiveservicestaskforce.org&sol;Page&sol;Document&sol;RecommendationStatementFinal&sol;high-blood-pressure-in-adults-screening. Accessed January 12, 2018.</URF></RF> <RF ID="R3-1">3. <URF>Centers for Disease Control and Prevention. High blood pressure fact sheet. https:&sol;&sol;www.cdc.gov&sol;bloodpressure&sol;facts.htm. Accessed January 12, 2018.</URF></RF> <RF ID="R4-1">4. <JRF><DRF>Lein DH Jr, Clark D, Graham C, Perez P, Morris D. A model to integrate health promotion and wellness in physical therapist practice: development and validation. Phys Ther. 2017;97(12):1169&ndash;1181.</DRF><PN><FN>DH</FN><SN>Lein</SN></PN><PN><FN>D</FN><SN>Clark</SN></PN><PN><FN>C</FN><SN>Graham</SN></PN><PN><FN>P</FN><SN>Perez</SN></PN><PN><FN>D</FN><SN>Morris</SN></PN><TI>A model to integrate health promotion and wellness in physical therapist practice: development and validation</TI><PB>Phys Ther</PB><DA><YR>2017</YR></DA><V>97</V><PG>1169&ndash;1181</PG></JRF></RF> <RF ID="R5-1">5. <URF>Riebe D, ed. ACSM&apos;s Guidelines for Exercise Testing and Prescription. 10th ed. Baltimore, Maryland: Wolters Kluwer; 2018.</URF></RF> <RF ID="R6-1">6. <JRF><DRF>Pickering TG, Hall JE, Appel LJ, et al Recommendations for blood pressure measurement in humans and experimental animals: part 1: blood pressure measurement in humans: a statement for professionals from the Subcommittee of Professional and Public Education of the American Heart Association Council on High Blood Pressure Research. Circulation. 2005;111(5):697&ndash;716.</DRF><PN><FN>TG</FN><SN>Pickering</SN></PN><PN><FN>JE</FN><SN>Hall</SN></PN><PN><FN>LJ</FN><SN>Appel</SN></PN><TI>Recommendations for blood pressure measurement in humans and experimental animals: part 1: blood pressure measurement in humans: a statement for professionals from the Subcommittee of Professional and Public Education of the American Heart Association Council on High Blood Pressure Research</TI><PB>Circulation</PB><DA><YR>2005</YR></DA><V>111</V><PG>697&ndash;716</PG></JRF></RF> <RF ID="R7-1">7. <JRF><DRF>Rabbia F, Testa E, Rabbia S, et al Effectiveness of blood pressure educational and evaluation program for the improvement of measurement accuracy among nurses. High Blood Press Cardiovasc Prev. 2013;20(2):77&ndash;80.</DRF><PN><FN>F</FN><SN>Rabbia</SN></PN><PN><FN>E</FN><SN>Testa</SN></PN><PN><FN>S</FN><SN>Rabbia</SN></PN><TI>Effectiveness of blood pressure educational and evaluation program for the improvement of measurement accuracy among nurses</TI><PB>High Blood Press Cardiovasc Prev</PB><DA><YR>2013</YR></DA><V>20</V><PG>77&ndash;80</PG></JRF></RF> <RF ID="R8-1">8. <JRF><DRF>Frese EM, Richter RR, Burlis TV. Self-reported measurement of heart rate and blood pressure in patients by physical therapy clinical instructors. Phys Ther. 2002;82(12):1192&ndash;1200.</DRF><PN><FN>EM</FN><SN>Frese</SN></PN><PN><FN>RR</FN><SN>Richter</SN></PN><PN><FN>TV</FN><SN>Burlis</SN></PN><TI>Self-reported measurement of heart rate and blood pressure in patients by physical therapy clinical instructors</TI><PB>Phys Ther</PB><DA><YR>2002</YR></DA><V>82</V><PG>1192&ndash;1200</PG></JRF></RF> <RF ID="R9-1">9. <JRF><DRF>Mouhavar E, Salahudeen A, Yeh ETH. Hypertension in cancer patients. Tex Heart Inst J. 2011;38(3):263&ndash;265.</DRF><PN><FN>E</FN><SN>Mouhavar</SN></PN><PN><FN>A</FN><SN>Salahudeen</SN></PN><PN><FN>ETH</FN><SN>Yeh</SN></PN><TI>Hypertension in cancer patients</TI><PB>Tex Heart Inst J</PB><DA><YR>2011</YR></DA><V>38</V><PG>263&ndash;265</PG></JRF></RF> <RF ID="R10-1">10. <URF>Gahart BL, Nazareno AR, eds. Intravenous Medications: A Handbook for Nurses and Health Professionals. 33rd ed. St. Louis, MO: Elsevier Health Sciences;
2016.</URF></RF></EDS></ED></D> <D AN="01893697-201804000-00002" V="2009.2F" FILE="G1893697-201804000-00002"> <BB> <TG> <TI>In 2018 &ldquo;Spring Is the Time of Plans and Projects&rdquo;</TI></TG> <BY> <PN><FN>Lucinda</FN><MN>(Cindy)</MN><SN>Pfalzer</SN><DEG>PT, PhD, FACSM, FAPTA</DEG></PN> <AF><P>Editor of <E T="I">Oncology Rehabilitation</E> and Emeriti Professor, Physical Therapy Department, University of Michigan-Flint, Flint, MI</P></AF> <BT><P><E T="B">Correspondence:</E> Lucinda (Cindy) Pfalzer, PT, PhD, FACSM, FAPTA, Physical Therapy Department, University of Michigan-Flint, 2157 WSW Bldg, Flint, MI 48502 (<URL>cpfalzer&commat;umich.edu</URL>).</P><P>The author declares no conflicts of interest.</P></BT></BY> <SO> <PB>Rehabilitation Oncology</PB> <ISN>2168-3808</ISN> <DA><MO>April</MO><YR>2018</YR></DA> <V>36</V> <IS><IP>2</IP></IS> <PG>81&ndash;82</PG></SO> <CP>&copy; 2018 Oncology Section, APTA.</CP> <DT>LETTER TO THE EDITOR</DT><XUI XDB="pub-doi" UI="10.1097/01.REO.0000000000000119"></BB> <BD>

You can download the xml file from here.

Thank you

EDIT: This is different from the question XML parser error: entity not defined This files are generated years ago (2000s etc.). I am not generating this files, i only try to parse them and get the meta data.

EDIT 2: Sorry i am also trying to parse with Dom Parser and added the errors from it when i created the post. Now i added the SimpleXML errors.

Ben Perry
  • 15
  • 5
  • 1
    Thats NOT an XML file. I think thats a SAP specific tag – RiggsFolly Nov 16 '18 at 13:29
  • I'm not sure how you're getting errors about `DOMDocument::loadXML` when you say you aren't calling that method – iainn Nov 16 '18 at 13:32
  • 1
    Possible duplicate of [XML parser error: entity not defined](https://stackoverflow.com/questions/3805050/xml-parser-error-entity-not-defined) – Mohammad Nov 16 '18 at 13:32
  • @RiggsFolly do you have any idea how to parse this file? – Ben Perry Nov 16 '18 at 13:52
  • SHort of looking for a library to help, no – RiggsFolly Nov 16 '18 at 13:55
  • @iainn Thank you for the information, i've changed the error messages. – Ben Perry Nov 16 '18 at 14:09
  • I don't know why anyone would download a random file from a shady file-sharing website. Post a representative sample of the data as part of your question. – miken32 Nov 16 '18 at 23:12
  • Please could you [edit] the question to include a small (but complete) portion of the file *as text*, so we don't have to rely on links to an external site (with a rather dubious history). Ideally, questions like this should contain a [mcve] - enough information that without any external resource, a reader could reproduce the error you're reporting, and test their suggested fixes. – IMSoP Nov 19 '18 at 14:10
  • Just in case anyone else wants to look into this - the document has ` ` as the first line. This may point to some appropriate libraries. – Nigel Ren Nov 19 '18 at 15:02

1 Answers1

0

The file doesn't stick to the XML spec, there are a few things like unknown entities and also non-closed tags.

Replacing the & with space will manage to ignore the entities, to solve some of the other problems it has been a case of using regular expressions to tidy the tags up (I'm not a regex expert, but the replacement takes <COVER ...> and converts it to <COVER ... />)...

$file = file_get_contents('20180400.xml');

$file = str_replace("&", "", $file); //For problems with & character
$file = preg_replace('/<COVER (.*?)>/', '<COVER $1 />', $file);
$file = preg_replace('/<XUI (.*?)>/', '<COVER $1 />', $file);
$file = preg_replace('/<TGP (.*?)>/', '<COVER $1 />', $file);

// libxml_use_internal_errors(true);
$xml = simplexml_load_string($file, 'SimpleXMLElement', LIBXML_NOCDATA);
echo $xml->asXML("out.xml");
Nigel Ren
  • 56,122
  • 11
  • 43
  • 55
  • I would be very hesitant about ad hoc fixes like this; from other comments, it sounds like the file may be in some non-XML format, where the actual meaning of these tags and entities might be relevant. Just stripping them out might lead to fragile code and incorrect results on other files. – IMSoP Nov 19 '18 at 14:10
  • @IMSoP, as with any answers on SO, it is up to OP to check that the code and any processing is up to what they need. If this is for some business purpose then I would assume there is some form of testing and validation in the project which again is something they must assume responsibility for. – Nigel Ren Nov 19 '18 at 14:17
  • Indeed, but some fixes are riskier than others, and I thought it worth calling out that this is on the "hack that will probably work but might cause problems later" end of the spectrum rather than the "well-recognised technique that you'll find in plenty of professional codebases" end of the spectrum. – IMSoP Nov 19 '18 at 14:31
  • @IMSoP, but if it was a choice of a *hack* or ditch all of the data and start again. With appropriate validation and oversight I would rather go with a hack - which is much less error prone than starting again. – Nigel Ren Nov 19 '18 at 14:38
  • I'm not disagreeing with posting this answer; I'm just saying that a warning that this is a hack might be sensible, in case readers get the impression that this is a good solution any time they have errors. Also, I don't think the alternative is "ditch the data and start again"; I think the alternative is "research what format the data is in and how its creator intended it to be parsed". Unless the data has been corrupted (in which case it's dangerous anyway), it's presumably *something other than XML*, and may be documented somewhere. – IMSoP Nov 19 '18 at 14:50
  • Thank you for the answer. I do it like this. I know this is not correct solution but this solved my issue for now. – Ben Perry Nov 19 '18 at 15:00
  • @IMSoP the data is sgml format and i couldn't find a way to parse it. So i tried to change it to a XML like format as your solution. – Ben Perry Nov 19 '18 at 15:06