1

Problem Statement:

I have to validate ID value of all elements in the HTML content. ID value rule is:-

ID value must begin with a letter ([A-Za-z]) and may be followed by any number of letters, digits ([0-9]), hyphens ("-"), underscores ("_"), colons (":"), and periods (".").

Code by Regular expression

>>> content
'<div id="11"> <p id="34"></p><div id="div2"> </div></div>'
>>> all_ids = re.findall("id=\"([^\"]*)\"", content)
>>> id_validation = re.compile("[A-Za-z][\-A-Za-z0-9_:\.]*$")
>>> invalid_ids = [i for i in all_ids if not bool(id_validation.match(i))]
>>> invalid_ids
['11', '34']

Code by xml.etree.ElementTree parser:

>>> import xml.etree.ElementTree as PARSER
>>> root = PARSER.fromstring(content)
>>> all_ids = [i.attrib["id"] for i in root.getiterator() if "id" in i.attrib]
>>> all_ids
['11', '34', 'div2']
>>> id_validation = re.compile("[A-Za-z][\-A-Za-z0-9_:\.]*$")
>>> [i for i in all_ids if not bool(id_validation.match(i))]
['11', '34']
>>>

It also one line by lxml, but exiting code NOT use lxml lib due to some reason.

>>> from lxml import etree
>>> root = etree.fromstring(content)
>>> root.xpath("//*/@id")
['11', '34', 'div2']

The input content contains 100000 of tags, so which is above process best for performance?

Time result:

For Big content:

Time RE:- 0.00285315513611 Time root:- 0.0108540058136

For small content:

Time RE:- 0.000186920166016 Time root:- 4.00543212891e-05

Vivek Sable
  • 9,938
  • 3
  • 40
  • 56
  • 1
    If not sure, measure it ([`timeit`](https://docs.python.org/2/library/timeit.html)). – alecxe Aug 04 '15 at 15:02
  • @alecxe: yes, If content is small means 500 tags the Parsing with XML lib is fast, but when content is big the RE is better. – Vivek Sable Aug 04 '15 at 15:35
  • Usually, using regular expressions for parsing a structured data (with actual language rules) is a bad practice. If you really trust your data and can guarantee regex approach in your case would be bullet-proof/stable/reliable, then go ahead and use regexes, see http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags for more explanations and examples.. – alecxe Aug 04 '15 at 15:37
  • I would though evaluate `lxml` first. – alecxe Aug 04 '15 at 15:38
  • `lxml` is good, but some reasons I not able to use(This is different issue). I also not like RE because if content changes then RE will fail badly. – Vivek Sable Aug 04 '15 at 15:41
  • @alecxe: What can I do? RE or XML parser? I have XML object which used for other validation process. – Vivek Sable Aug 04 '15 at 15:42

0 Answers0