0

Sometimes I am not sure when do I have to use one or another. I usually parse all sort of things with Python, but I would like to focus this question on HTML parsing.

Personally I find DOM manipulation really useful when having to parse more than two regular elements (i.e. title and body of a list of news, for example).

However, I found myself in situations where it is not clear for me to build a regex or try to get the desired value simply manipulating strings. A particular fictional example: I have to get the total number of photos of an album, and the only way to get this is parsing the number of photos using this way:

(1 of 190)

So I have to get the '190' from the whole HTML document. I could write a regex for that, although regex for parsing HTML is not exactly the best, or that is what I always understood. On the other hand, using DOM seems overwhelming for me as it is just a simple element. String manipulation seems to be the best way, but I am not really sure if I should proceed like that in such a similar case.

Can you tell me how would you parse these kind of single elements from a HTML document using Python (or any other language)?

Bob Dem
  • 991
  • 1
  • 11
  • 23

2 Answers2

4

It's a subjective question (with subjective answers) but in general I'd try to avoid using regex for parsing HTML/XML, as has been previously discussed in SO. Only if the input string with the markup is small and with no possibilities of getting more complex, and the pattern being searched is unambiguous and easily described as a regex, would I use a regex. It's a matter of balancing the right tool for the job with the need to be practical.

For your concrete example, I think it'd be OK to start with a regex. But if you find yourself extracting additional information from the input and/or the regex starts to get cumbersome, switch to a parser.

Community
  • 1
  • 1
Óscar López
  • 232,561
  • 37
  • 312
  • 386
2

People shy away from doing regexes to search HTML because it isn't the right tool for the job when parsing tags. But everything should be considered on a case-by-case basis. You aren't searching for tags, you are searching for a well-defined string in a document. It seems to me the simplest solution is just a regex or some sort of XPath expression -- simple parsing requires simple tools.

Colonel Panic
  • 1,604
  • 2
  • 20
  • 31