Using Python to get a field in XML from URL

Question

I'm trying to get information from a specific field from a XML file from a URL. I'm getting these weird erros before I even start to try. Here is my code:

url1 = 'http://www.dac.unicamp.br/sistemas/horarios/grad/G5A0/indiceP.htm'
data1 = urllib.urlopen(url1)
xml1 = minidom.parse(data1)

I get this error:

File "C:\Users\Administrator\Desktop\teste.py", line 15, in <module>
    xml1 = minidom.parse(data1)
  File "C:\Python27\lib\xml\dom\minidom.py", line 1920, in parse
    return expatbuilder.parse(file)
  File "C:\Python27\lib\xml\dom\expatbuilder.py", line 928, in parse
    result = builder.parseFile(file)
  File "C:\Python27\lib\xml\dom\expatbuilder.py", line 207, in parseFile
    parser.Parse(buffer, 0)
ExpatError: not well-formed (invalid token): line 4, column 22

Did I do anything wrong? I copied those functions from a tutorial, and it seems like it should be working..

seems like the page is not xhtml valid, try using beautifulsoup. — luke14free, Oct 18 '12 at 15:38
@luke14free Oh, is that a thing? So if the page is not valid for XML parsing, is there another way I can get the information I want? If you enter the page you can see in the top right corner, "Verão/2012 ", that's the field I'm looking for. — Laís Minchillo, Oct 18 '12 at 15:41
Try this out: http://validator.w3.org/ Just paste the url in the address input field — Alexander Stefanov, Oct 18 '12 at 15:48

score 1 · Answer 1 · answered Oct 18 '12 at 15:45

1

use lxml.html, it handles invalid xhtml better.

import lxml.html as lh
In [24]: xml1=lh.parse('http://www.dac.unicamp.br/sistemas/horarios/grad/G5A0/indiceP.htm')

answered Oct 18 '12 at 15:45

root

76,608
25
108
120

@root I heard good things about BeautifulSoup. How do they compare? – cwallenpoole Oct 18 '12 at 16:24
http://stackoverflow.com/questions/1922032/parsing-html-in-python-lxml-or-beautifulsoup-which-of-these-is-better-for-wha – root Oct 18 '12 at 16:31
some say BS handles badly formed source better, but from personal experience lxml.html does as well. for well formed source i would say, lxml is superior. lxml is also much faster. – root Oct 18 '12 at 16:38

Using Python to get a field in XML from URL

1 Answers1