How to get sentences from the website html

Question

Hello I want to extract all sentences from a html document. How can i perform that? as there are many conditions like first we need to strip tags, then we need to identify sentences which may end with . or ? or ! also there might be conditions like email address and website address also may have . in them How do we make some script like this?

This is a huge task if it needs to deliver good results on arbitrary data. What do you need this for exactly? — Pekka, Mar 03 '11 at 11:09

score 7 · Answer 1 · edited May 23 '17 at 11:55

7

It's called programming ;). Start by dividing the task in simpler sub-tasks and implement those. For example, in your case, I'd design the program like this:

Download and parse the HTML document
Extract all text content (pay special attention to <script> and <style> elements)
Merge the text content to one long string
Solve the problem of finding sentences in a string (likely, just parse until you find a stop character in ".!?" and then start a new sentence)
Discard false positives (Like empty sentences, number-only sentences etc.)

edited May 23 '17 at 11:55

Community

1
1

answered Mar 03 '11 at 11:13

phihag

278,196
72
453
469

What if the long text is not in English? How to get sentences in that case? – Apr 15 '11 at 21:00
@edo888 Most western languages have similar stop characters. If there is no character dividing sentences, your only hope is a linguistic analysis - i.e. parsing the text and applying the rules that define where a sentence ends or starts. There is no general solution for all languages. Feel free to ask a new question about a specific language. The first 3 steps in this answer are language-independent. – phihag Apr 15 '11 at 21:17

score 0 · Answer 2 · answered Mar 03 '11 at 11:29

First you should strip certain tags which are inline formatting elemnts like:

I <b>strongly</b> agree.

But you sbhould leave in block-level elements, like DIV and P because there are even stronger delimiters than . ? and !

Then you have to process the content in these block level elements. Typically there are navigation links with one word, you might want to filter them out later, so it is not the right choice to strip away the block structure of the document.

At this point you can safely use the regex pattern to identify blocks:

>([^<]+)<

When you have your blocks you can filter out the short ones (navigation elemnts) and strip the big ones (paragraphs of text) using your sentence delimiter.

There are interesting questions when a fullstop character signals an end of the sentenct and when is it just a decimal point, but I leave that to you. :)

How to get sentences from the website html

2 Answers2