there are <div>
inner blocks inside a <div>
block,
What is the fastest way to extract all <div>
blocks from a html str ?
(bs4, lxml or regex ?)
Using python3, what is the fastest way to extract all blocks from a html str?
Asked
Active
Viewed 1,294 times
-1
-
1
https://www.crummy.com/software/BeautifulSoup/ is great for parsing HTML.
– ospahiu
Aug 25 '16 at 21:43
-
Is it important that you write the code natively yourself ? There is a python package called scrapy http://scrapy.org/ that you can install that has methods that you can call for parsing html.
– NiallJG
Aug 25 '16 at 21:43
-
Give this a try find: `.*`, replace ''. Guaranteed no more divs anywhere..
–
Aug 25 '16 at 22:18
2 Answers
2
lxml
is generally considered to be the fastest among existing Python parsers, though the parsing speed depends on multiple factors starting with the specific HTML to parse and ending with the computational power you have available. For HTML parsing use the lxml.html
subpackage:
from lxml.html import fromstring, tostring
data = """my HTML string"""
root = fromstring(data)
print([tostring(div) for div in root.xpath(".//div")])
print([div.text_content() for div in root.xpath(".//div")])
There is also the awesome BeautifulSoup
parser which, if allowed to use lxml
under-the-hood, would be a great combination of convenience, flexibility and speed. It would not be generally faster than pure lxml
, but it comes with one of the best APIs I've ever seen allowing you to "view" your XML/HTML from different angles and use a huge variety of techniques:
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, "lxml")
print([str(div) for div in soup.find_all("div")])
print([div.get_text() for div in soup.find_all("div")])
And, I personally think, there is rarely a case when regex is suitable for HTML parsing:
-
root.xpath(".//div") will show a list of objects, how to get the block str ?
– V Y
Aug 25 '16 at 21:58
-
@VY are you interested in the text contents of the tags or HTML representations? Thanks.
– alecxe
Aug 25 '16 at 21:59
-
-
what if I want to exact and
blocks, is there any way to use lxml xpath extract multiple kinds of tags ?
– V Y
Aug 25 '16 at 22:24
-
@VY sure, with lxml comes the power of XPath expressions, see http://stackoverflow.com/questions/721928/xpath-to-select-multiple-tags.
– alecxe
Aug 25 '16 at 22:42
-
I tried root.xpath(".//div or .//p"), didn't work, do you mind write down the code ?
– V Y
Aug 25 '16 at 22:50
-
0
When I'm teaching XML/HTML parsing with Python, I use to show this levels of complexity:
- RegEx: efficient for (very) simple parsing but can be/become hard to maintain.
- SAX: efficient and safe to parse XML as a stream. Easy to extract pieces of data but awful when you want to transform the tree. Can become really difficult to maintain. Who still use that anyway?
- DOM parsing or ElementTree parsing with lxml: less efficient: all the XML tree is loaded in memory (can be an issue for big XML). But this library is compiled (in Cython). Very popular and reliable. Easy to understand: the code can be maintained.
- XSLT1 is also a possibility. Very good to transform the tree in depth. But not efficient because of the templates machinery. Need learn a new language which appears to be difficult to learn. Maintenance often become heavy. Note that lxml can do XSLT with Python functions as an extension.
- XSLT2 is very powerful but the only implementation I know is in Java language with Saxon. Launching the JRE is time consuming. The language is difficult to learn. One need to be an expert to understand every subtleties. Worse as XSLT1.
For your problem, lxml (or BeautifulSoup) sound good.
Pedro del Sol
- 2,840
- 9
- 39
- 52
Laurent LAPORTE
- 21,958
- 6
- 58
- 103
-
How would you assess this tag parsing regex ? `<(?:(?:(?:(script|style|head|object|embed|applet|noframes|noscript|noembed)(?:\s+(?>"[\S\s]*?"|'[\S\s]*?'|(?:(?!/>)[^>])?)+)?\s*>)[\S\s]*?\1\s*(?=>))|(?:/?[\w:]+\s*/?)|(?:[\w:]+\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]?)+\s*/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>`
–
Aug 25 '16 at 22:23
-
Splitted and documented, it can be maintained. The question is: do your colleagues can fix a bug in this? Efficiency can be very good even with a long RegEx.
– Laurent LAPORTE
Aug 25 '16 at 22:30
-
The solution to maintenance is [RegexFormat 7](http://www.regexformat.com). And there are no bugs in this. I use the core tag subexpression's in many variations, generalized for out of order attribute matching, search and replace, many variations. I've even used this to create a SAX parser, used it for many years, %100 regex.
–
Aug 25 '16 at 22:42
-
1https://www.crummy.com/software/BeautifulSoup/ is great for parsing HTML. – ospahiu Aug 25 '16 at 21:43
-
Is it important that you write the code natively yourself ? There is a python package called scrapy http://scrapy.org/ that you can install that has methods that you can call for parsing html. – NiallJG Aug 25 '16 at 21:43
-
Give this a try find: `.*`, replace ''. Guaranteed no more divs anywhere.. – Aug 25 '16 at 22:18
2 Answers
lxml
is generally considered to be the fastest among existing Python parsers, though the parsing speed depends on multiple factors starting with the specific HTML to parse and ending with the computational power you have available. For HTML parsing use the lxml.html
subpackage:
from lxml.html import fromstring, tostring
data = """my HTML string"""
root = fromstring(data)
print([tostring(div) for div in root.xpath(".//div")])
print([div.text_content() for div in root.xpath(".//div")])
There is also the awesome BeautifulSoup
parser which, if allowed to use lxml
under-the-hood, would be a great combination of convenience, flexibility and speed. It would not be generally faster than pure lxml
, but it comes with one of the best APIs I've ever seen allowing you to "view" your XML/HTML from different angles and use a huge variety of techniques:
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, "lxml")
print([str(div) for div in soup.find_all("div")])
print([div.get_text() for div in soup.find_all("div")])
And, I personally think, there is rarely a case when regex is suitable for HTML parsing:
-
root.xpath(".//div") will show a list of objects, how to get theblock str ?– V Y Aug 25 '16 at 21:58
-
@VY are you interested in the text contents of the tags or HTML representations? Thanks. – alecxe Aug 25 '16 at 21:59
-
-
what if I want to exactand– V Y Aug 25 '16 at 22:24
blocks, is there any way to use lxml xpath extract multiple kinds of tags ?
-
@VY sure, with lxml comes the power of XPath expressions, see http://stackoverflow.com/questions/721928/xpath-to-select-multiple-tags. – alecxe Aug 25 '16 at 22:42
-
I tried root.xpath(".//div or .//p"), didn't work, do you mind write down the code ? – V Y Aug 25 '16 at 22:50
-
When I'm teaching XML/HTML parsing with Python, I use to show this levels of complexity:
- RegEx: efficient for (very) simple parsing but can be/become hard to maintain.
- SAX: efficient and safe to parse XML as a stream. Easy to extract pieces of data but awful when you want to transform the tree. Can become really difficult to maintain. Who still use that anyway?
- DOM parsing or ElementTree parsing with lxml: less efficient: all the XML tree is loaded in memory (can be an issue for big XML). But this library is compiled (in Cython). Very popular and reliable. Easy to understand: the code can be maintained.
- XSLT1 is also a possibility. Very good to transform the tree in depth. But not efficient because of the templates machinery. Need learn a new language which appears to be difficult to learn. Maintenance often become heavy. Note that lxml can do XSLT with Python functions as an extension.
- XSLT2 is very powerful but the only implementation I know is in Java language with Saxon. Launching the JRE is time consuming. The language is difficult to learn. One need to be an expert to understand every subtleties. Worse as XSLT1.
For your problem, lxml (or BeautifulSoup) sound good.

- 2,840
- 9
- 39
- 52

- 21,958
- 6
- 58
- 103
-
How would you assess this tag parsing regex ? `<(?:(?:(?:(script|style|head|object|embed|applet|noframes|noscript|noembed)(?:\s+(?>"[\S\s]*?"|'[\S\s]*?'|(?:(?!/>)[^>])?)+)?\s*>)[\S\s]*?\1\s*(?=>))|(?:/?[\w:]+\s*/?)|(?:[\w:]+\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]?)+\s*/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>` – Aug 25 '16 at 22:23
-
Splitted and documented, it can be maintained. The question is: do your colleagues can fix a bug in this? Efficiency can be very good even with a long RegEx. – Laurent LAPORTE Aug 25 '16 at 22:30
-
The solution to maintenance is [RegexFormat 7](http://www.regexformat.com). And there are no bugs in this. I use the core tag subexpression's in many variations, generalized for out of order attribute matching, search and replace, many variations. I've even used this to create a SAX parser, used it for many years, %100 regex. – Aug 25 '16 at 22:42