0

I am currently using BeautifulSoup findAll function to extract the desired attributes of the web page. However it fails to get all the desired parts and returns None for some parts. My python code just like this:

from bs4 import BeautifulSoup
import urllib

url = 'http://code.google.com/p/android/issues/detail?id=1060&colspec=ID Type Status Owner Summary Stars Opened Closed Modified Reporter Cc Project Reportedby Priority Version Target Milestone Component MergedInto BlockedOn Blocking Blocked Subcomponent Attachments'
issue_page = urllib.urlopen(url).read()

soup = BeautifulSoup(issue_page)
comment_parts =  soup.findAll(name = 'div',attrs={'class':'cursor_off vt issuecomment'})
for comment_part in comment_parts:
    print str(comment_part)+'\n'

It only get the first 48 ones and the 49th and subsequent ones are not returned. I viewed the source code the corresponding html page, and the 49th is just the same as the 48th and previous ones. I really can not figure it out why it happens! Is there anybody can help me out? Thanks a lot!

terry
  • 43
  • 6

1 Answers1

1

When I execute your code, I get 58 results.

... Your code ...
print len(comment_parts)

... and,

print comment_parts[-1]

prints the last item on the page. Are you getting something different?

jbiz
  • 394
  • 1
  • 5
  • Thanks very much for your quick reply. I made a mistake in the question and I have edited it just now. Actually, I only got 48 results and there should be about 10 more results be returned. The output of `comment_parts[-1]` is "“
    ". In addition, I do the experiment with Ubuntu 13.04 and Python2.7.
    – terry Sep 10 '13 at 05:27
  • It seems that this problem is related to the version of BeautifulSoap. When using version **4.3.1** this problem happens. While when I changed to version "3.2.1" it works fine! – terry Sep 10 '13 at 07:26
  • I just ran your code using bs4 4.3.1 and got the same results as before ... that is 58 results with the last one being the final comment. Are you able to experiment with virtual enviroments? Which version of python are you using? – jbiz Sep 10 '13 at 15:16
  • I do experiment with python2.7 on Ubuntu13.04, and it fails to get all the results with ``bs4`` but works correctly with ``BeautifulSoup 3.2.1``. Just now, I tried the code with python2.7 on Windows7 with ``bs4`` and it works correctly. It is realy oddly. – terry Sep 11 '13 at 04:56
  • Thanks very much for your help. I find the answer here [BeatutifulSoup findAll dose not find them all](http://stackoverflow.com/questions/16322862/beautiful-soup-findall-doent-find-them-all). The problem is related to the HTML parser used. I installed the``lxml`` and the ``BeautifulSoup`` use it as default which fails to deal with broken HTML very well. I set the parser to ``html.parser`` like this ``soup = BeautifulSoup(issue_page,'html.parser')`` and now it works! – terry Sep 11 '13 at 05:16