How to parse xml from url

Question

Generally, what I want to do is parsing xml from url. This is what I have done:

I write the xml code on html file enclosed on <'textarea'><'\textarea'> tag:

<textarea rows="1000" cols="200" style="border:none;">
<?xml version="1.0"?>
<data>
  <gambar>
    <id>wcl01</id>
    <url>https://1.bp.blogspot.com/- j9yARC6mAuY/Xp4aUTxe6eI/AAAAAAAAAGA/NegvRkwYdVAXhnTsrWoXYcjAzsHfR6BOQCLcBGAsYHQ/s320/Konferensi%2BIIWAS%2Bdi%2BVietnam.jpg</url>
  </gambar>
  <gambar>
    <id>wcl02</id>
    <url>https://1.bp.blogspot.com/-aIkYkd3ePMY/XqDDsTMYMAI/AAAAAAAAAHA/QKZOQ8cPr_0LUfLNrYrA3w6gvNV-ao-QCLcBGAsYHQ/s320/Konferensi%2BAptikom%2Bdi%2BBandung%2B1.jpg</url>
  </gambar>
</data>
</textarea>

On the website, this is how it looks:

Then I parse the xml using this code:

from urllib.request import urlopen
from xml.etree.ElementTree import parse
from lxml import etree
var_url = urlopen('https://imanparyudi.000webhostapp.com/gambar.html')
xmldoc = parse(var_url)
elem = etree.XML(xmldoc, parser=parser)

but I got this error:

    File "<string>", line unknown ParseError: XML or text declaration not at start of entity: line 2, column 0

I assume that this error is caused by whitespace at the beginning of the xml code. So, I have tried to remove this whitespace using, first: etree.XMLParser(remove_blank_text=True) and second: etree.XMLParser(recover=True) like this:

    from urllib.request import urlopen
    from xml.etree.ElementTree import parse
    from lxml import etree
    parser = etree.XMLParser(remove_blank_text=True)
    var_url = urlopen('https://imanparyudi.000webhostapp.com/gambar.html')
    xmldoc = parse(var_url)
    elem = etree.XML(xmldoc, parser=parser)

and

    from urllib.request import urlopen
    from xml.etree.ElementTree import parse
    from lxml import etree
    parser = etree.XMLParser(recover=True)
    var_url = urlopen('https://imanparyudi.000webhostapp.com/gambar.html')
    xmldoc = parse(var_url)
    elem = etree.XML(xmldoc, parser=parser)

But, both ways give the same error:

    File "<string>", line unknown ParseError: XML or text declaration not at start of entity: line 2, column 0

So, my questions here are:

a. Is this problem cause by the use of <'textarea'><'\textarea'> tag?

b. If so, how can I post my xml code on a website?

c. If not, how can solve this ParseError?

Why are you not using `parse` from `lxml.etree`? Don't mix APIs. `lxml` extends Python's ElementTree API so shares many methods with `etree`. Also, please post the output of `var_url` not as screenshot. Let's see exact object you intend to `parse`. — Parfait, Feb 14 '21 at 00:53

score 1 · Accepted Answer · answered Feb 14 '21 at 10:55

1

You get back HTML doc.
Inside the HTML there is <textarea> that holds the XML doc.
The code below point to the XML doc and parse it.

import requests
import xml.etree.ElementTree as ET

r = requests.get('https://imanparyudi.000webhostapp.com/gambar.html')
if r.status_code == 200:
    start = r.text.find('<?xml')
    end = r.text.find('</textarea>')
    root = ET.fromstring(r.text[start:end])
    print(root)

answered Feb 14 '21 at 10:55

balderman

22,927
7
34
52

Hi @balderman, it works. Thank you very much. But I have one question. What is this: r.status_code == 200 ? – Iman Feb 15 '21 at 23:42
The status code is the http status code that is returned from the http server. 200 means OK – balderman Feb 16 '21 at 06:04
Another question. In my case, I post my xml code in html using tag. My question is: is it possible to post xml code in html without using tag? If so, how can I do that? – Iman Feb 17 '21 at 00:32
Just upload the xml to the website and point the browser to it – balderman Feb 17 '21 at 06:26

score 0 · Answer 2 · answered Feb 14 '21 at 00:15

0

It's probably because you're missing this at the beginning of the document

answered Feb 14 '21 at 00:15

Lazt Omen

40
1
7

score 0 · Answer 3 · answered Feb 14 '21 at 00:45

Cause

An XML declaration,

<?xml version="1.0"?>

may only appear once and only at the very top on an XML document.

Clearly, having

<textarea rows="1000" cols="200" style="border:none;">

ahead of it violates that requirement.

Remedies

Since you're only specifying that the XML is version 1.0, and that's the default anyway, simply remove the XML declaration, or
Remove everything before the XML declaration.

How to parse xml from url

3 Answers3

Cause

Remedies

See also