I am currently using BeautifulSoup findAll
function to extract the desired attributes of the web page. However it fails to get all the desired parts and returns None
for some parts. My python code just like this:
from bs4 import BeautifulSoup
import urllib
url = 'http://code.google.com/p/android/issues/detail?id=1060&colspec=ID Type Status Owner Summary Stars Opened Closed Modified Reporter Cc Project Reportedby Priority Version Target Milestone Component MergedInto BlockedOn Blocking Blocked Subcomponent Attachments'
issue_page = urllib.urlopen(url).read()
soup = BeautifulSoup(issue_page)
comment_parts = soup.findAll(name = 'div',attrs={'class':'cursor_off vt issuecomment'})
for comment_part in comment_parts:
print str(comment_part)+'\n'
It only get the first 48 ones and the 49th and subsequent ones are not returned. I viewed the source code the corresponding html page, and the 49th is just the same as the 48th and previous ones. I really can not figure it out why it happens! Is there anybody can help me out? Thanks a lot!