Extracting data from a web page

Question

I am doing a school project which needs extracting data from web pages. To be precise I need a library or opensource program to extract human readable content from html/text data. Something like web browser rendered text content.

I know parsing html with regexs is worst method to extract text from it.

Extra info:

I need it for computing similarity between text documents.

Any help would be appreciated. Thanks

score 1 · Answer 1 · edited May 23 '17 at 12:03

1

I would highly recommend this question's first answer in an effort to keep you away from parsing HTML with regular expressions. That answer does a far better job of illustrating why you shouldn't than I could, so I defer to that.

You will also find that you should look into XML parsers instead of trying to "parse by hand" via a regex (which you'll read in the referenced question and its answers).

edited May 23 '17 at 12:03

Community

1
1

answered Apr 19 '11 at 02:44

Ryan Wersal

3,210
1
20
29

I'll be doing the process for thousands of docs. and My doubt is that If i parse the data using regex, JavaScript functions might appear. One more thing is I'll be missing dynamic content or javascript rendered data.Thanks for answering :) – Aditya Apr 19 '11 at 02:53

ninjagecko · Answer 2 · 2011-04-19T02:43:40.017

0

If all you care is textual similarity, you could just write a regex to strip out all the HTML tags of the form </?(every|single|valid|tag)[^>]*> (perhaps first removing all <script>.*</script> tags), then mash all the content up in a very long paragraph. That wouldn't be a bad use of a regex at all; that's what they're there for.

I might recommend http://docs.python.org/library/xml.dom.minidom.html , but imho the interface can be very awkward. Also you don't need access to the hierarchical structure, just the text. Otherwise a parser would be better than a regex (which would otherwise be a terrible idea).

edited Apr 19 '11 at 02:43

answered Apr 19 '11 at 02:37

ninjagecko

88,546
24
137
145

I'll be doing the process for thousands of docs. and My doubt is that If i parse the data using regex, JavaScript functions might appear. One more thing is I'll be missing dynamic content or javascript rendered data.Thanks for answering :) – Aditya Apr 19 '11 at 02:54
I believe the example algorithm I gave you will probably not cause javascript functions to appear as long as you aren't parsing the entire world-wide web. Also you will be missing javascript-rendered content nomatter what program you use, unless you are doing it via the web browser. – ninjagecko Apr 19 '11 at 03:07

Extracting data from a web page

2 Answers2