There are many scripts extracts articles from html pages. If using regular expression to get the only main article from html or PHP page source, what is the best regular expressions to get only the main article. Also, what is the simplest and the best ways to get those without regular expressions, in PHP or other programs only. Some scripts are using many filters to extract the main article from html or PHP source with problems of non-English languages, characters, and multi-byte characters. As the results, they can not good portion of the main article from the source because the above problems.
Normally, the main article must be in 'div', 'p', or other tags in the html or PHP source. Then, other html elements in the page with navigations, links, extracts, and others. Using regular expressions can solve the prior problems easily by defining multi-byte, character, and language differences in the expressions. Most of the article extraction software uses filters to find 'comment', first', 'next', 'nav', 'button', 'submit', and others to check if the portions they are holding are the content or other elements. The tags, ids, classes, and other tags are most likely only valid in English and ISO-west-European character only. They can not extract exact portion of the article because they do not understand the languages or the characters they are trying to filter.
The Below algorithms to filter articles from the other elements used by an article extraction script source boilerpipe are using ; (If you examine the 'src' files closely.)
- Check if the characters are long enough. ( character and word count )
- Check if the tags are on suggest list, comment, first, next, nav, and others. ( array searches with or without regular expressions )
- Other checks to verify the article from other html elements, heuristic, and others.
There are theories of article extraction of webpages, but not simple than using regular expressions. They can convert to simply regular expressions or other simple programs.
The boilerpipe written in java to extract articles, but it is too complex and the problem of the languages and characters. Preferably, using both several regular expressions and some other regular program to filter the article is better.
The exact things that I'm looking for are below
Regular expressions to extract only the articles from html and PHP pages. Using few regular expressions to extract only the article from html or PHP source without any other elements and other expressions to check non-article possibility.
Non regular expressions to extract only the articles from html and PHP pages. Using PHP to extract only the article from html or PHP source without using regular expression in simple way. Also, need check if the article or not.
Both of them must not in language and character set limitations, example of multi-bytes and simple sufficient to fit a single page.