robust DOM parsing with getElementsByTagName

Question

The following (from "Dive into Python")

from xml.dom import minidom
xmldoc = minidom.parse('/path/to/index.html')
reflist = xmldoc.getElementsByTagName('img')

failed with

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/path/to/htmlToNumEmbedded.py", line 2, in <module>
    xmldoc = minidom.parse('/path/to/index.html')
  File "/usr/lib/python2.7/xml/dom/minidom.py", line 1918, in parse
    return expatbuilder.parse(file)
  File "/usr/lib/python2.7/xml/dom/expatbuilder.py", line 924, in parse
    result = builder.parseFile(fp)
  File "/usr/lib/python2.7/xml/dom/expatbuilder.py", line 207, in parseFile
    parser.Parse(buffer, 0)
xml.parsers.expat.ExpatError: mismatched tag: line 12, column 4

Using lxml, which is recommended by http://www.ianbicking.org/blog/2008/12/lxml-an-underappreciated-web-scraping-library.html, allows you to parse the document, but it does not seem to have an getElementsByTagName. The following works:

from lxml import html
xmldoc = html.parse('/path/to/index.html')
root = xmldoc.getroot()
for i in root.iter("img"):
    print i

but seems kludgey: is there a built-in function that I overlooked?

Or another more elegant way to have robust DOM parsing with getElementsByTagName?

score 1 · Accepted Answer · answered Mar 22 '16 at 13:31

1

If you want a list of Element, instead of iterating the return value of the Element.iter, call list on it:

from lxml import html
reflist = list(html.parse('/path/to/index.html.html').iter('img'))

answered Mar 22 '16 at 13:31

falsetru

357,413
63
732
636

score 0 · Answer 2 · edited May 23 '17 at 10:28

0

You can use BeautifulSoup for this:

from bs4 import BeautifulSoup

with open('/path/to/index.html') as f:
    soup = BeautifulSoup(f)
soup.find_all("img")

See Going through HTML DOM in Python

edited May 23 '17 at 10:28

Community

1
1

answered Mar 22 '16 at 13:24

serv-inc

35,772
9
166
188

robust DOM parsing with getElementsByTagName

2 Answers2