0

I'm trying to get information from a specific field from a XML file from a URL. I'm getting these weird erros before I even start to try. Here is my code:

url1 = 'http://www.dac.unicamp.br/sistemas/horarios/grad/G5A0/indiceP.htm'
data1 = urllib.urlopen(url1)
xml1 = minidom.parse(data1)

I get this error:

File "C:\Users\Administrator\Desktop\teste.py", line 15, in <module>
    xml1 = minidom.parse(data1)
  File "C:\Python27\lib\xml\dom\minidom.py", line 1920, in parse
    return expatbuilder.parse(file)
  File "C:\Python27\lib\xml\dom\expatbuilder.py", line 928, in parse
    result = builder.parseFile(file)
  File "C:\Python27\lib\xml\dom\expatbuilder.py", line 207, in parseFile
    parser.Parse(buffer, 0)
ExpatError: not well-formed (invalid token): line 4, column 22

Did I do anything wrong? I copied those functions from a tutorial, and it seems like it should be working..

bedwyr
  • 5,774
  • 4
  • 31
  • 49
Laís Minchillo
  • 741
  • 1
  • 7
  • 15
  • 2
    seems like the page is not xhtml valid, try using beautifulsoup. – luke14free Oct 18 '12 at 15:38
  • @luke14free Oh, is that a thing? So if the page is not valid for XML parsing, is there another way I can get the information I want? If you enter the page you can see in the top right corner, "Verão/2012 ", that's the field I'm looking for. – Laís Minchillo Oct 18 '12 at 15:41
  • Try this out: http://validator.w3.org/ Just paste the url in the address input field – Alexander Stefanov Oct 18 '12 at 15:48

1 Answers1

1

use lxml.html, it handles invalid xhtml better.

import lxml.html as lh
In [24]: xml1=lh.parse('http://www.dac.unicamp.br/sistemas/horarios/grad/G5A0/indiceP.htm')
root
  • 76,608
  • 25
  • 108
  • 120
  • @root I heard good things about BeautifulSoup. How do they compare? – cwallenpoole Oct 18 '12 at 16:24
  • http://stackoverflow.com/questions/1922032/parsing-html-in-python-lxml-or-beautifulsoup-which-of-these-is-better-for-wha – root Oct 18 '12 at 16:31
  • some say BS handles badly formed source better, but from personal experience lxml.html does as well. for well formed source i would say, lxml is superior. lxml is also much faster. – root Oct 18 '12 at 16:38