1

I am getting an XML document like this:

<?xml version="1.0" encoding="UTF-8"?>
@namespace html url(http://www.w3.org/1999/xhtml); :root { font:small Verdana; font-wei.... huge list of styling
<items>
    <item>
    ...

That second line seems to be preventing me from parsing the file.

Using Tidy

<?php
$config = array(
       'indent'     => true,
       'input-xml'  => true,
       'output-xml' => true,
       'wrap'       => false);
$tidy = new tidy;
$tidy->parseFile('https://website.com/path/to/XML.ashx?param=12345', $config);
$tidy->cleanRepair();

print_r($tidy);
?>

which will result in:

tidy Object
(
    [errorBuffer] => 
    [value] => <?xml version="1.0" encoding="utf-8"?>

)

Using simplexml_load_file()

<?php
$xml = simplexml_load_file('https://website.com/path/to/XML.ashx?param=12345');
print_r($xml);
?>

output:

**Warning**: simplexml_load_file(): https://website.com/path/to/XML.ashx?param=12345:1: parser error : Start tag expected, '<' not found in **C:\xampp\htdocs\local\php\script.php** on line 2

**Warning**: simplexml_load_file(): <?xml version="1.0" encoding="utf-8" ?> in **C:\xampp\htdocs\local\php\script.php** on line 2

**Warning**: simplexml_load_file(): ^ in **C:\xampp\htdocs\local\php\script.php** on line 2

I've also tried various cURL options and simply file_get_contents()

My question is: What is that second line of XML and how can I parse this file?

sqlab
  • 6,412
  • 1
  • 14
  • 29
Greg Rudd
  • 13
  • 2
  • 1
    That is rather bizarre. Everything in normal XML is wrapped in a tag. I would strip it out because no parser will handle it. – Machavity Jul 29 '14 at 22:38
  • I don't know what that @ stuff is; but XML it's not. – Dan Is Fiddling By Firelight Jul 29 '14 at 22:40
  • But when I enter the url to the xml document in a browser it is parsed just fine. It'll say "This XML file does not appear to have any style information associated with it. The document tree is shown below." and then displays the XML(?) neatly. When I view source there is not @namespace or styling junk – Greg Rudd Jul 29 '14 at 22:56
  • Did you mean to close the trimmed section on line 2 of your XML with a `}` closing brace? – halfer Jul 29 '14 at 23:00
  • yeah that line continues for 1,216 characters, so i didn't include it all. As far as I know it's all closed up. – Greg Rudd Jul 29 '14 at 23:08
  • Yes, I just checked, all brackets are properly closed – Greg Rudd Jul 29 '14 at 23:14
  • 2
    Browsers will make insanely heroic attempts to parse anything with angle brackets no matter how horribly flawed because if they didn't at least 90% of web sites you attempted to visit would just get a broken html error screen. – Dan Is Fiddling By Firelight Jul 30 '14 at 00:18
  • It is not XML it is XHTML and the @namespace html url(http://www.w3.org/1999/xhtml); is declaring the html namespace. Hence your browser understands it but an XML parser won't. – Dijkgraaf Jul 30 '14 at 02:11
  • @Dan Neely: That's only for HTML. This is XML; it's clear based on the output given by the OP's comment that the browser is parsing it as XML, and XML parsers **never** make such corrections of any kind to XML that is not well-formed - they will just abort with an error immediately. In this case, the XML is not well-formed, so it *should* cause an error. – BoltClock Jul 30 '14 at 12:03

2 Answers2

2

XML does not allow non-whitespace textnodes after the XML-Declaration. So what you have is invalid XML and this it what the libraries are telling you. But Tidy (release 25 March 2009) can deal with that:

$buffer = '<?xml version="1.0" encoding="UTF-8"?>
@namespace html url(http://www.w3.org/1999/xhtml); :root { font:small Verdana; font-wei.... huge list of styling
<items>
    <item></item> </items>';

$config = array(
    'indent'     => true,
    'input-xml'  => true,
    'output-xml' => true,
    'wrap'       => false);
$tidy = new tidy;
$tidy->parseString($buffer, $config);
$tidy->cleanRepair();

print_r($tidy);

Output:

tidy Object
(
    [errorBuffer] => line 2 column 1 - Warning: discarding unexpected plain text
    [value] => <?xml version="1.0" encoding="utf-8"?>
<items>
  <item></item>
</items>
)

So you most likely have more issues with that "XML" (or it's a limitation of a buffer if you have a very very large line there).

As this isn't XML you might ask yourself what that is? It's CSS, and what you have there is a so called at-ruleQ&A, more concrete a CSS Namespace Declaration. (Browsers (user-agents) per the earlier CSS specs did not have to support any of these. Even the current CSS Selector API requires any namespace prefix resolutions to cause an exception within the API. A good example of CSS namespace usage with a XML (XHTML) document is in this earlier answer).

What follows in your chunk of text is a namespace prefix and the CSS under it.

So what you have there is a mixture of different data. It won't parse valid as XML and you won't find any common browser that actually can deal with that CSS either - even if it would validate - because it's not clear that that text is CSS (it would need to be wrapped inside an element denoting a stylesheet).

Side-Note: A correct CSS parser would drop the XML here as it's invalid and the CSS specs denote that anything invalid needs to be dropped. So what you have there - in the whole - could technically conform as CSS document. You think it's XML, it's just CSS ;)

So as bizarre as this @ rule might sound to you, it actually isn't. It exists, just not at such a place.

On the other hand it's not really helpful to cover up the source as website.com - seeing the real site might have given more context to tell you more.

Community
  • 1
  • 1
hakre
  • 193,403
  • 52
  • 435
  • 836
  • `@namespace` declarations and namespaced CSS selectors have excellent browser support. It just has no meaning here because it isn't appearing in a situation where a browser would reasonably try to parse it as CSS. – BoltClock Jul 30 '14 at 11:59
  • In earlier versions browsers didn't need to support it. In currect versions the CSS selector APIs do not need to support namespace declarations, that is to change the prefix to query it then. I don't know any browser which allows declaring namespaces by prefixes for the selector API. I would call that excellent browser support, but you probably know better. – hakre Jul 30 '14 at 12:09
  • Also, the reason why the document you link to says it's a draft is because you're linking to the *editor's draft* - which is basically the trunk in the document's revision system. CSS namespaces were made a W3C Recommendation [years ago](http://www.w3.org/TR/css3-namespace). – BoltClock Jul 30 '14 at 12:10
  • You got me wrong again, the CSS specs wrote that namespaces must not be supported by the user-agent. Not that namespaces weren't specified. And I had that link of yours earlier, will fix. – hakre Jul 30 '14 at 12:11
  • The reason the Selectors API doesn't allow you to declare namespaces is because the Selectors API is not CSS, whereas `@namespace` is part of CSS syntax. If you wrote an XHTML document with namespace declarations and selectors in a CSS stylesheet, you'll find that any browser including IE9+ will apply the styles just fine. – BoltClock Jul 30 '14 at 12:12
  • But the Selectors API uses CSS Selectors which - as they are CSS - also contain CSS namespaces (via the prefix), so this is still missing. As prefixes are aliases and interchangeable because only the namespace URI counts - not the prefix. - Or I'm wrong and CSS Selectors aren't part of the CSS syntax. – hakre Jul 30 '14 at 12:15
  • That's where it gets understandably confusing :) The "CSS" in "CSS selectors" is an artifact from when selectors were indeed originally developed as part of the CSS language, but today selectors have been repurposed for uses outside of CSS. The current [Selectors spec](http://www.w3.org/TR/selectors) makes a distinction between CSS syntax and selector syntax, which you can see in the abstract. (It's still maintained by the CSSWG however since it was originally a CSS language feature.) – BoltClock Jul 30 '14 at 12:19
  • So while an implementation may use CSS selectors, that does not necessarily make it a CSS engine, just a selector engine. – BoltClock Jul 30 '14 at 12:21
  • Well, I now found something which was not clear to me: *"These groups of selectors should not use namespace prefixes that need to be resolved"* as this *should not* is RFC it means that prefixes should not used with the selectors API at current state. This resolves that the reason of dissent disappears as no API implementation should actually support that. Instead it *must* even raise a syntax error exception when finding a namespace that needs resolution. So this is all moot. - http://www.w3.org/TR/selectors-api/#grammar + following – hakre Jul 30 '14 at 12:38
  • I apologize for obscuring the website name and I appreciate your detail despite the lack of context. I did this firstly, because the website regards my personal finances and secondly, because it requires a login to access anyways. I bring this up because I actually think this is the crux of the problem, I am not passing a cookie to `website.com` when using Tidy to validate an authenticated session. I think this is why Tidy returns no error and just the document type declaration in my first example, because `website.com` will send the DTD but no XML data to an unauthenticated rquest – Greg Rudd Jul 30 '14 at 19:07
  • `simplexml_load_file()` returns `Start tag expected, '<' not found` because it expects XML data no matter what, and since I was not authenticated `website.com` did not send me XML data so simplexml fired errors – Greg Rudd Jul 30 '14 at 19:09
  • SimpleXMLElement gives errors because the string is not valid XML. I understand the authentication issue and that you don't want to share credentials. Can you request the XML with PHP, e.g. with `file_get_contents` for example? If so you can get the XML into a string so you can decouple fetching the XML from parsing it. That also can help with trouble-shooting. – hakre Jul 30 '14 at 19:13
0

It is not XML it is XHTML (Extensible HyperText Markup Language) and the below is declaring the html namespace, followed by CSS styling. Hence your browser understands it but an XML parser won't.

@namespace html url(w3.org/1999/xhtml); 

It is HTML meant to be compliant with XML however it looks like this page may not compliant to the strict XHTML and hence not parsing as XML.

Dijkgraaf
  • 11,049
  • 17
  • 42
  • 54
  • Except a browser has no reason to assume that a line beginning with a `@namespace` token is CSS unless it's specifically within a CSS context. – BoltClock Jul 30 '14 at 11:52
  • And a document with a root element called `` is most certainly *not* an XHTML document. – BoltClock Jul 30 '14 at 11:53