How to parse bad html?

Question

I am writing a search engine that goes to all my company affiliates websites parse html and stores them in database. These websites are really old and are not html compliant out of 100000 websites around 25% have bad html that makes it difficult to parse. I need to write a c# code that might fix bad html and then parse the contents or come up with a solution that will address above said issue. If you are sitting on idea, an actual hint or code snippet would help.

Possible duplicate of http://stackoverflow.com/questions/4587727/screen-scraping-html-with-c-sharp — Ani, May 23 '12 at 13:30
i know Wordpress has an auto-correct html code, you can view it's source code to see how they do it - and try the same logic — eric.itzhak, May 23 '12 at 13:30
What do you mean by bad html? If tags aren't closed and stuff like that I think parsing is going to be a nightmare. — Sachin Kainth, May 23 '12 at 13:30

score 4 · Accepted Answer · answered May 23 '12 at 13:31

4

Just use Html Agility Pack. It is the very good to parse faulty html code

answered May 23 '12 at 13:31

esskar

10,638
3
36
57

1

-1 to HAP: it 'parses' it but gets completely wrong DOM model for a number of pages I've tried it on. – Ash Berlin-Taylor Aug 08 '12 at 20:54

score 1 · Answer 2 · answered May 23 '12 at 13:31

1

People generally use some form of heuristic-driven tag soup parser.

E.g. for

Java
Haskell

These are mostly just lexers, that try their best to build an AST from all the random symbols.

answered May 23 '12 at 13:31

Don Stewart

137,316
36
365
468

score 1 · Answer 3 · edited May 23 '17 at 12:20

1

Use a tagsoup parser, I'm sure the is one for C#. Then you can serialize the DOM to a more-or less valid HTML, depending on whether that parser conforms to the HTML DTD. Alternatively you can use HTML Tidy, which will clear at least the worst faults.

Regexes are not applicable for this task.

edited May 23 '17 at 12:20

Community

1
1

answered May 23 '12 at 13:36

Bergi

630,263
148
957
1,375

How to parse bad html?

3 Answers3

Linked