c# .net4 - regex vs html agility pack

Question

What's faster? I just made a web scraper that uses HTML Agility pack and it's consuming massive amounts of memory.

Profiling it with a memory profiler, I found that the HTMLDocument, HTMLNode, etc, instances are taking up the most amount of memory.

I feel like maybe it would be faster and more efficient to use regex, am I wrong?

As a rule of thumb, the less you import, the faster the program; the more you import, the faster the programmer. Certainly, regexes are cheaper (unless they were using regexs behind the scenes.) — jpaugh, May 31 '12 at 04:30
See that famous question here on SO: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags It should discourage you from using Regex to parse HTML. — Simon Mourier, Jun 15 '12 at 09:26

score 1 · Answer 1 · answered May 31 '12 at 04:29

1

Depending on what exactly you do it really could be possible to speed things up and free some mem using regex. The question is - how rigid and well-formed are the pages you are extracting data from. Regex is much more easily confused by perfectly valid, but unexpected, HTML constructs that you might encounter in the wild.

answered May 31 '12 at 04:29

Eugene Ryabtsev

2,232
1
23
37

And, possibly, *less* confused by perfectly invalid code that is found in the wild. – jpaugh May 31 '12 at 04:31
@Eugene My thoughts exactly. In fact, just today I made the switch from regex TO Html Agility Pack for this very reason. – kaveman May 31 '12 at 04:31

score 1 · Accepted Answer · answered May 31 '12 at 04:30

A reg-ex will be a lot faster than html agilty pack.

But you should remember that html need not always be well formed. Searching the correct data you want using only reg-ex may fail. Browsers are very forgiving about mistakes.

Agility pack is a great tool. It provides a lot of features for that memory it is consuming.

c# .net4 - regex vs html agility pack

2 Answers2