Html Parsing vs. Regex

Question

I have a fixed well structured html source, incoming data is clear and small, just contains a little list of divs. I know that using a html parser for html parsing but this looks like a particular case and i am not sure which one that i should use. The problem conditions below

Data is clear and well structured
Data is small
Performance matters, application must be able to get as much as data that is possibble
Application will write data to MongoDB database
Implementation programming language will be Scala or Python

Any opinion is valuable so what should I do?

score 7 · Accepted Answer · edited May 23 '17 at 12:31

7

I would still stick to using an HTML Parser, because, at least, there is a specific data format and a specialized tool that understands the format.

If performance matters here, there is a blazingly fast lxml package. For the HTML, use lxml.html.

You can also use an awesome BeautifulSoup package and let it use lxml parser under-the-hood. Besides, if the data you need to parse is in a specific part of the HTML document, you can have a performance gain by asking BeautifulSoup to parse only the relevant part of the HTML document, see more at: Parsing only part of a document.

And, to follow the tradition for HTML+regex threads, here is the reference to the famous topic covering the reasons why you should not use regex for parsing HTML:

RegEx match open tags except XHTML self-contained tags

edited May 23 '17 at 12:31

Community

1
1

answered Oct 11 '14 at 20:15

alecxe

462,703
120
1,088
1,195

I know that what i shouldn't use regex to html parsing, i know what is regex and what it turns when implemented,yes i got automata lesson too, most of reasons are about html unstable structres and big amount of data, which is not true for our case, we have a well structred and small data to process. so, i appreciate your answer but this is not we are lookng for, i think. – Hüseyin Zengin Oct 11 '14 at 20:59
@HüseyinZengin thanks. It's difficult to say without seeing what kind of data you have, how much of it and what data you need to parse from it. I guess your best bet would be to measure the performance yourself. For example, implement it using `lxml` and `regex`-only approach and benchmark it. – alecxe Oct 11 '14 at 21:02

Html Parsing vs. Regex

1 Answers1