I want to extract a few text out of a webpage. I searched StackOverFlow (as well as other sites) to find a proper method. I used HTML2TEXT, BEAUTIFULSOUP, NLTK and some other manual methods to do extraction and I failed for example:
- HTML2TEXT works on offline (=saved pages) and I need to do it online.
- BS4 won't work properly on Unicode (My page is in UTF8 Persian encoding) and it won't extract the text. It also returns HTML tags\codes. I only need rendered text.
- NLTK won't work on my Persian text. Even while trying to open my page with urllib.request.urlopen I encounter some errors. So as you see I'm so much stuck after trying several methods.
Here's my target URL: http://vynylyn.yolasite.com/page2.php I want to extract only Persian paragraphs without tags\codes.
(Note: I use Eclipse Kepler w\ Python 34 also I want to extract text then I want to do POS Tagging, Word\Sentence Tokenizing, etc on the text.)
What are my options to get this working?