3

I have a fixed well structured html source, incoming data is clear and small, just contains a little list of divs. I know that using a html parser for html parsing but this looks like a particular case and i am not sure which one that i should use. The problem conditions below

  • Data is clear and well structured
  • Data is small
  • Performance matters, application must be able to get as much as data that is possibble
  • Application will write data to MongoDB database
  • Implementation programming language will be Scala or Python

Any opinion is valuable so what should I do?

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
Hüseyin Zengin
  • 1,216
  • 11
  • 23

1 Answers1

7

I would still stick to using an HTML Parser, because, at least, there is a specific data format and a specialized tool that understands the format.

If performance matters here, there is a blazingly fast lxml package. For the HTML, use lxml.html.

You can also use an awesome BeautifulSoup package and let it use lxml parser under-the-hood. Besides, if the data you need to parse is in a specific part of the HTML document, you can have a performance gain by asking BeautifulSoup to parse only the relevant part of the HTML document, see more at: Parsing only part of a document.

And, to follow the tradition for HTML+regex threads, here is the reference to the famous topic covering the reasons why you should not use regex for parsing HTML:

Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • I know that what i shouldn't use regex to html parsing, i know what is regex and what it turns when implemented,yes i got automata lesson too, most of reasons are about html unstable structres and big amount of data, which is not true for our case, we have a well structred and small data to process. so, i appreciate your answer but this is not we are lookng for, i think. – Hüseyin Zengin Oct 11 '14 at 20:59
  • @HüseyinZengin thanks. It's difficult to say without seeing what kind of data you have, how much of it and what data you need to parse from it. I guess your best bet would be to measure the performance yourself. For example, implement it using `lxml` and `regex`-only approach and benchmark it. – alecxe Oct 11 '14 at 21:02