Reading source page saved in a text file and extracting text

Question

I have a multiple text files which have been used to store source pages from a website. So each text file is a source page.

I need to extract text from a div class stored in the text file using the following code:

from bs4 import BeautifulSoup
soup = BeautifulSoup(open("zing.internet.accelerator.plus.txt"))
txt = soup.find('div' , attrs = { 'class' : 'id-app-orig-desc' }).text
print txt

I have checked type of my soup object to make sure It is not using string find method while looking for the div class. Type of soup object

print type(soup)
<class 'bs4.BeautifulSoup'>

I have already taken reference from one of the previous post, and written open statement inside beautifulsoup statement.

Error:

Traceback (most recent call last):
  File "html_desc_cleaning.py", line 13, in <module>
    txt2 = soup.find('div' , attrs = { 'class' : 'id-app-orig-desc' }).text
AttributeError: 'NoneType' object has no attribute 'text'

Source from the page:

Don't upload image add the text because the image is not useful — styvane, Oct 14 '15 at 06:10

Remi Guan · Answer 1 · 2015-10-14T06:17:07.013

8

Try replace this:

soup = BeautifulSoup(open("zing.internet.accelerator.plus.txt"))

with this:

soup = BeautifulSoup(open("zing.internet.accelerator.plus.txt").read())

And by the way, close the file after read it is a good idea. You can use with like this:

with open("zing.internet.accelerator.plus.txt") as f:
    soup = BeautifulSoup(f.read())

with will auto close the file.

Here is an example about why you need .read() function:

>>> a = open('test.txt')
>>> type(a)
<class '_io.TextIOWrapper'>

>>> print(a)
<_io.TextIOWrapper name='test.txt' mode='r' encoding='UTF-8'>

>>> b = a.read()
>>> type(b)
<class 'str'>

>>> print(b)
Hey there.

>>> print(open('test.txt'))
<_io.TextIOWrapper name='test.txt' mode='r' encoding='UTF-8'>

>>> print(open('test.txt').read())
Hey there.

edited Oct 14 '15 at 06:17

answered Oct 14 '15 at 06:11

Remi Guan

21,506
17
64
87

Hey Thanks. I tried the above code and included read also but still getting the same error :( – Pappu Jha Oct 14 '15 at 06:20
Hmm...try `print open("zing.internet.accelerator.plus.txt").read()` – Remi Guan Oct 14 '15 at 06:21
It is printing the whole source page – Pappu Jha Oct 14 '15 at 06:23
Good start. What about `txt = soup.find_all('div')`? Try use it instead `txt = soup.find('div' , attrs = { 'class' : 'id-app-orig-desc' }).text`. – Remi Guan Oct 14 '15 at 06:24
It has printed many div classes but not the one I am looking for. – Pappu Jha Oct 14 '15 at 06:29
That is. `open()` function is working now. Try `txt = soup.find('div', {'class': 'id-app-orig-desc'}).text`. – Remi Guan Oct 14 '15 at 06:31

score 2 · Accepted Answer · answered Oct 14 '15 at 14:04

I have solved the problem.

The default parser for beautifulsoup in my case was 'lxml' which was not able to read the complete source page.

changing the parser to 'html.parser' has worked for me.

f = open("zing.internet.accelerator.plus.txt")
soup = f.read()
bs = BeautifulSoup(soup,"html.parser")
print bs.find('div',{'class' : 'id-app-orig-desc'}).text

Reading source page saved in a text file and extracting text

2 Answers2