0

Generally, what I want to do is parsing xml from url. This is what I have done:

  1. I write the xml code on html file enclosed on <'textarea'><'\textarea'> tag:

    <textarea rows="1000" cols="200" style="border:none;">
    <?xml version="1.0"?>
    <data>
      <gambar>
        <id>wcl01</id>
        <url>https://1.bp.blogspot.com/- j9yARC6mAuY/Xp4aUTxe6eI/AAAAAAAAAGA/NegvRkwYdVAXhnTsrWoXYcjAzsHfR6BOQCLcBGAsYHQ/s320/Konferensi%2BIIWAS%2Bdi%2BVietnam.jpg</url>
      </gambar>
      <gambar>
        <id>wcl02</id>
        <url>https://1.bp.blogspot.com/-aIkYkd3ePMY/XqDDsTMYMAI/AAAAAAAAAHA/QKZOQ8cPr_0LUfLNrYrA3w6gvNV-ao-QCLcBGAsYHQ/s320/Konferensi%2BAptikom%2Bdi%2BBandung%2B1.jpg</url>
      </gambar>
    </data>
    </textarea>
    

On the website, this is how it looks:

enter image description here

  1. Then I parse the xml using this code:

    from urllib.request import urlopen
    from xml.etree.ElementTree import parse
    from lxml import etree
    var_url = urlopen('https://imanparyudi.000webhostapp.com/gambar.html')
    xmldoc = parse(var_url)
    elem = etree.XML(xmldoc, parser=parser)
    

but I got this error:

    File "<string>", line unknown ParseError: XML or text declaration not at start of entity: line 2, column 0

I assume that this error is caused by whitespace at the beginning of the xml code. So, I have tried to remove this whitespace using, first: etree.XMLParser(remove_blank_text=True) and second: etree.XMLParser(recover=True) like this:

    from urllib.request import urlopen
    from xml.etree.ElementTree import parse
    from lxml import etree
    parser = etree.XMLParser(remove_blank_text=True)
    var_url = urlopen('https://imanparyudi.000webhostapp.com/gambar.html')
    xmldoc = parse(var_url)
    elem = etree.XML(xmldoc, parser=parser)

and

    from urllib.request import urlopen
    from xml.etree.ElementTree import parse
    from lxml import etree
    parser = etree.XMLParser(recover=True)
    var_url = urlopen('https://imanparyudi.000webhostapp.com/gambar.html')
    xmldoc = parse(var_url)
    elem = etree.XML(xmldoc, parser=parser)

But, both ways give the same error:

    File "<string>", line unknown ParseError: XML or text declaration not at start of entity: line 2, column 0
  1. So, my questions here are:

a. Is this problem cause by the use of <'textarea'><'\textarea'> tag?

b. If so, how can I post my xml code on a website?

c. If not, how can solve this ParseError?

Iman
  • 83
  • 1
  • 1
  • 8
  • Why are you not using `parse` from `lxml.etree`? Don't mix APIs. `lxml` extends Python's ElementTree API so shares many methods with `etree`. Also, please post the output of `var_url` not as screenshot. Let's see exact object you intend to `parse`. – Parfait Feb 14 '21 at 00:53

3 Answers3

1

You get back HTML doc.
Inside the HTML there is <textarea> that holds the XML doc.
The code below point to the XML doc and parse it.

import requests
import xml.etree.ElementTree as ET

r = requests.get('https://imanparyudi.000webhostapp.com/gambar.html')
if r.status_code == 200:
    start = r.text.find('<?xml')
    end = r.text.find('</textarea>')
    root = ET.fromstring(r.text[start:end])
    print(root)
balderman
  • 22,927
  • 7
  • 34
  • 52
  • Hi @balderman, it works. Thank you very much. But I have one question. What is this: r.status_code == 200 ? – Iman Feb 15 '21 at 23:42
  • The status code is the http status code that is returned from the http server. 200 means OK – balderman Feb 16 '21 at 06:04
  • Another question. In my case, I post my xml code in html using tag. My question is: is it possible to post xml code in html without using tag? If so, how can I do that? – Iman Feb 17 '21 at 00:32
  • Just upload the xml to the website and point the browser to it – balderman Feb 17 '21 at 06:26
0

It's probably because you're missing this at the beginning of the document

Lazt Omen
  • 40
  • 1
  • 7
0

Cause

An XML declaration,

<?xml version="1.0"?>

may only appear once and only at the very top on an XML document.

Clearly, having

<textarea rows="1000" cols="200" style="border:none;">

ahead of it violates that requirement.

Remedies

  • Since you're only specifying that the XML is version 1.0, and that's the default anyway, simply remove the XML declaration, or
  • Remove everything before the XML declaration.

See also

kjhughes
  • 106,133
  • 27
  • 181
  • 240