Problem Statement:
I have to validate ID value of all elements in the HTML content. ID value rule is:-
ID value must begin with a letter ([A-Za-z]) and may be followed by any number of letters, digits ([0-9]), hyphens ("-"), underscores ("_"), colons (":"), and periods (".").
Code by Regular expression
>>> content
'<div id="11"> <p id="34"></p><div id="div2"> </div></div>'
>>> all_ids = re.findall("id=\"([^\"]*)\"", content)
>>> id_validation = re.compile("[A-Za-z][\-A-Za-z0-9_:\.]*$")
>>> invalid_ids = [i for i in all_ids if not bool(id_validation.match(i))]
>>> invalid_ids
['11', '34']
Code by xml.etree.ElementTree parser:
>>> import xml.etree.ElementTree as PARSER
>>> root = PARSER.fromstring(content)
>>> all_ids = [i.attrib["id"] for i in root.getiterator() if "id" in i.attrib]
>>> all_ids
['11', '34', 'div2']
>>> id_validation = re.compile("[A-Za-z][\-A-Za-z0-9_:\.]*$")
>>> [i for i in all_ids if not bool(id_validation.match(i))]
['11', '34']
>>>
It also one line by lxml
, but exiting code NOT use lxml
lib due to some reason.
>>> from lxml import etree
>>> root = etree.fromstring(content)
>>> root.xpath("//*/@id")
['11', '34', 'div2']
The input content contains 100000 of tags, so which is above process best for performance?
Time result:
For Big content:
Time RE:- 0.00285315513611 Time root:- 0.0108540058136
For small content:
Time RE:- 0.000186920166016 Time root:- 4.00543212891e-05