Text Extracting: Used All Methods, Yet Stuck

Question

I want to extract a few text out of a webpage. I searched StackOverFlow (as well as other sites) to find a proper method. I used HTML2TEXT, BEAUTIFULSOUP, NLTK and some other manual methods to do extraction and I failed for example:

HTML2TEXT works on offline (=saved pages) and I need to do it online.
BS4 won't work properly on Unicode (My page is in UTF8 Persian encoding) and it won't extract the text. It also returns HTML tags\codes. I only need rendered text.
NLTK won't work on my Persian text. Even while trying to open my page with urllib.request.urlopen I encounter some errors. So as you see I'm so much stuck after trying several methods.

Here's my target URL: http://vynylyn.yolasite.com/page2.php I want to extract only Persian paragraphs without tags\codes.

(Note: I use Eclipse Kepler w\ Python 34 also I want to extract text then I want to do POS Tagging, Word\Sentence Tokenizing, etc on the text.)

What are my options to get this working?

If you have a library only works "offline," that's because it doesn't have any network capabilities of its own. That's OK, though, because HTML and HTTP are entirely separate technologies. Use a network library (like the one Python includes) to [download your page from the Internet](http://stackoverflow.com/q/22676/33732), and then use your HTML library to process it. — Rob Kennedy, Jan 16 '15 at 20:24

declension · Accepted Answer · 2015-12-28T22:43:12.343

1

I'd go for your second option at first. BeautifulSoup 4 should (and does) definitely support unicode (note it's UTF-8, a global character encoding, so there's nothing Persian about it).

And yes, you will get tags, as it's an HTML page. Try searching for a unique ID, or look at the HTML structure on the page(s). For your example, look for element main and then content elements below that, or maybe use div#I1_sys_txt in that specific page. Once you have your element, you just need to call get_text().

Try this (now in Python 3):

#!/usr/bin/env python3
import requests
from bs4 import BeautifulSoup

content = requests.get('http://vynylyn.yolasite.com/page2.php')
soup = BeautifulSoup(content.text)

tag = soup.find('div', id='I1_sys_txt')
print(tag.get_text() if tag else "<none found>")

edited Dec 28 '15 at 22:43

answered Jan 16 '15 at 19:23

declension

4,110
22
25

Thanks Nick. It did solve the case. It returned the whole Persian Text without any damn tag just pure text. I can't vote it up now since it needs me to have 15 reputations so I just marked it as accepted. – Vynylyn Jan 17 '15 at 05:55
It needs some modification to work w\ Py34: if tag: print (tag.get_text()) else: print ('None found') Your answer was Okay I just rewrite that part for people who may come to this question later. Thanks for help! – Vynylyn Jan 17 '15 at 05:56
Glad it worked (yes that was python2 syntax)! So yes, you could also just use `print(tag.get_text() if tag else "")` for Python 3.x – declension Jan 18 '15 at 11:06

Text Extracting: Used All Methods, Yet Stuck

1 Answers1

Linked