3

I have a multiple text files which have been used to store source pages from a website. So each text file is a source page.

I need to extract text from a div class stored in the text file using the following code:

from bs4 import BeautifulSoup
soup = BeautifulSoup(open("zing.internet.accelerator.plus.txt"))
txt = soup.find('div' , attrs = { 'class' : 'id-app-orig-desc' }).text
print txt

I have checked type of my soup object to make sure It is not using string find method while looking for the div class. Type of soup object

print type(soup)
<class 'bs4.BeautifulSoup'>

I have already taken reference from one of the previous post, and written open statement inside beautifulsoup statement.

Error:

Traceback (most recent call last):
  File "html_desc_cleaning.py", line 13, in <module>
    txt2 = soup.find('div' , attrs = { 'class' : 'id-app-orig-desc' }).text
AttributeError: 'NoneType' object has no attribute 'text'

Source from the page:

enter image description here

Community
  • 1
  • 1
Pappu Jha
  • 477
  • 1
  • 3
  • 14

2 Answers2

8

Try replace this:

soup = BeautifulSoup(open("zing.internet.accelerator.plus.txt"))

with this:

soup = BeautifulSoup(open("zing.internet.accelerator.plus.txt").read())

And by the way, close the file after read it is a good idea. You can use with like this:

with open("zing.internet.accelerator.plus.txt") as f:
    soup = BeautifulSoup(f.read())

with will auto close the file.


Here is an example about why you need .read() function:

>>> a = open('test.txt')
>>> type(a)
<class '_io.TextIOWrapper'>

>>> print(a)
<_io.TextIOWrapper name='test.txt' mode='r' encoding='UTF-8'>

>>> b = a.read()
>>> type(b)
<class 'str'>

>>> print(b)
Hey there.

>>> print(open('test.txt'))
<_io.TextIOWrapper name='test.txt' mode='r' encoding='UTF-8'>

>>> print(open('test.txt').read())
Hey there.
Remi Guan
  • 21,506
  • 17
  • 64
  • 87
2

I have solved the problem.

The default parser for beautifulsoup in my case was 'lxml' which was not able to read the complete source page.

changing the parser to 'html.parser' has worked for me.

f = open("zing.internet.accelerator.plus.txt")
soup = f.read()
bs = BeautifulSoup(soup,"html.parser")
print bs.find('div',{'class' : 'id-app-orig-desc'}).text
Pappu Jha
  • 477
  • 1
  • 3
  • 14